Multiindex merge in dask - python

I am trying to inner merge two dask dataframes based on two ids namely doi and pmid.
The datasets look like this (only the head, feel free to modify the doi and pmid to construct a MWE):
dd_papers_all:
{'cited_by_count': {0: nan, 1: nan, 2: 9.0, 3: nan, 4: 30.0},
'cited_by_url': {0: 'None',
1: 'None',
2: "['W1968224982', 'W1977435724', 'W2003814720', 'W2006453929', 'W2015063028', 'W2139614344']",
3: 'None',
4: "['W181218938', 'W1969520123', 'W1970043627', 'W1977525191', 'W2006834484', 'W2057850214', 'W2062554850', 'W2070252209', 'W2098074569', 'W2123616561', 'W2150154625', 'W2408116868']"},
'authors': {0: "['Walczak M', 'Pawlaczyk J']",
1: "['Ioan Oliver Avram']",
2: "['T.M. Dmitrieva', 'T.P. Eremeeva', 'G.I. Alatortseva', 'Vadim I. Agol']",
3: "['Djurdjina Ružić', 'Tatjana Vujović', 'Gabriela Libiaková', 'Radosav Cerović', 'Alena Gajdošová']",
4: "['M. Harris']"},
'institutions': {0: '[]',
1: '[]',
2: "['Lomonosov Moscow State University', 'USSR Academy of Medical Sciences', 'Lomonosov Moscow State University', 'USSR Academy of Medical Sciences']",
3: "['Fruit Research Institute', 'Fruit Research Institute', 'Institute of Plant Genetics and Biotechnology', 'Fruit Research Institute', 'Institute of Plant Genetics and Biotechnology']",
4: "['Durban University of Technology']"},
'paper_id': {0: 'W155261221',
1: 'W145424619',
2: 'W1482328891',
3: 'W1581876373',
4: 'W1978891149'},
'pub_year_col': {0: 1969, 1: 2010, 2: 1980, 3: 2012, 4: 2008},
'level_0': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'level_1': {0: "['Ophthalmology', 'Pediatrics', 'Internal medicine', 'Endocrinology', 'Anatomy', 'Developmental psychology']",
1: "['Risk analysis (engineering)', 'Manufacturing engineering', 'Industrial engineering', 'Operations research', 'Mechanical engineering', 'Epistemology', 'Climatology', 'Macroeconomics', 'Operating system']",
2: "['Virology', 'Cell biology', 'Biochemistry', 'Quantum mechanics']",
3: "['Horticulture', 'Botany', 'Biochemistry']",
4: "['Pedagogy', 'Mathematics education', 'Paleontology', 'Social science', 'Neuroscience', 'Programming language', 'Quantum mechanics']"},
'level_2': {0: "['Craniopharyngioma', 'Fundus (uterus)', 'Girl', 'Ventricle', 'Fourth ventricle']",
1: "['Process (computing)', 'Production (economics)', 'Machine tool', 'Quality (philosophy)', 'Productivity', 'Multiple-criteria decision analysis', 'Machining', 'Forcing (mathematics)']",
2: "['Mechanism (biology)', 'Replication (statistics)', 'Virus', 'Gene']",
3: "['Vaccinium', 'In vitro']",
4: "['Negotiation', 'Reflective practice', 'Construct (python library)', 'Action research', 'Context (archaeology)', 'Reflective writing', 'Perception', 'Power (physics)']"},
'doi': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'pmid': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'mag': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
and dd_green_papers_frontier:
{'paperid': {0: 2006817976,
1: 2006817976,
2: 1972698438,
3: 1968223008,
4: 2149313415},
'uspto': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'doi': {0: '10.1016/J.AB.2003.07.015',
1: '10.1016/J.AB.2003.07.015',
2: '10.1007/S002170100404',
3: '10.1007/S002170100336',
4: '10.3324/%X'},
'pmid': {0: 14656521.0, 1: 14656521.0, 2: nan, 3: nan, 4: 12414351.0},
'publn_nr_x': {0: 2693, 1: 2693, 2: 2693, 3: 2693, 4: 2715},
'paperyear': {0: 2003.0, 1: 2003.0, 2: 2001.0, 3: 2001.0, 4: 2002.0},
'papertitle': {0: 'development of melting temperature based sybr green i polymerase chain reaction methods for multiplex genetically modified organism detection',
1: 'development of melting temperature based sybr green i polymerase chain reaction methods for multiplex genetically modified organism detection',
2: 'event specific detection of roundup ready soya using two different real time pcr detection chemistries',
3: 'characterisation of the roundup ready soybean insert',
4: 'hepatitis c virus infection in a hematology ward evidence for nosocomial transmission and impact on hematologic disease outcome'},
'magfieldid': {0: 149629.0,
1: 149629.0,
2: 143660080.0,
3: 40767140.0,
4: 2780572000.0},
'oecd_field': {0: '2. Engineering and Technology',
1: '2. Engineering and Technology',
2: '1. Natural Sciences',
3: '1. Natural Sciences',
4: '3. Medical and Health Sciences'},
'oecd_subfield': {0: '2.11 Other engineering and technologies',
1: '2.11 Other engineering and technologies',
2: '1.06 Biological sciences',
3: '1.06 Biological sciences',
4: '3.02 Clinical medicine'},
'wosfield': {0: 'Food Science & Technology',
1: 'Food Science & Technology',
2: 'Biochemical Research Methods',
3: 'Biochemistry & Molecular Biology',
4: 'Hematology'},
'author': {0: 2083283000.0, 1: 2808753700.0, 2: nan, 3: 2315123700.0, 4: nan},
'country_alpha3': {0: 'ESP', 1: 'ESP', 2: nan, 3: 'BEL', 4: nan},
'country_2': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'docdb_family_id': {0: 37137417,
1: 37137417,
2: 37137417,
3: 37137417,
4: 35462722},
'publn_nr_y': {0: 2693.0, 1: 2693.0, 2: 2693.0, 3: 2693.0, 4: 2715.0},
'cpc_class_interest': {0: 'Y02', 1: 'Y02', 2: 'Y02', 3: 'Y02', 4: 'Y02'}}
Specifically, doi is a string variable and pmid is a float.
Now, what I am trying to do (please feel free to suggest me a smarter way to merge the db since they are very large) is the following:
dd_papall_green = dd_papers_all.merge(
dd_green_papers_frontier,
how="inner",
on=["pmid", "doi"]
).persist()
but it fails with the error:
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
Hence, what I did is to convert both doi and pmid to float as:
dd_papers_all['pmid']=dd_papers_all['pmid'].astype(float)
dd_papers_all['doi']=dd_papers_all['doi'].astype(float)
dd_green_papers_frontier['pmid']=dd_green_papers_frontier['pmid'].astype(float)
dd_green_papers_frontier['doi']=dd_green_papers_frontier['doi'].astype(float)
but again the merge fails.
How can I perform the described merge?

Related

Merge dataframes in pandas with a combination of keys

I have two dataframes that I need to combine together based on a key (an 'incident number'). The key, however, is repeated, as the database they will be ingested by requires a particular format for coordinates. How can join the necessary columns based on a combination of keys?
For example, the two tables look like:
Incident_Number
Lat/Long
GPSCoordinates
AB123
Lat
32.123
AB123
Long
120.123
CD321
Lat
31.321
CD321
Long
121.321
and...
Incident_Number
Lat/Long
GeoCodeCoordinates
AB123
Lat
35.123
AB123
Long
125.123
CD321
Lat
36.321
CD321
Long
126.321
And I need to get to...
IncidentNumber
Lat/Long
GPSCoordinates
GeoCodeCoordinates
AB123
Lat
32.123
35.123
AB123
Long
120.123
125.123
CD321
Lat
31.321
36.321
CD321
Long
121.321
126.321
The number of records are not 100% equal in each table so it needs to allow for NaNs. I am essentially trying to add the column 'GeoCodeCoordinates' to the other dataframe on a combination of 'Incident Number' and 'Lat/Long', so it will treat the value 'AB123 + Lat' and 'AB123 + Long' as a single key. Can this be specified within code, or does a new column and a calculation to create that value as a key need to be created?
I imagine I went about this in a bit of a goofy way. The Lat and Long were originally stored in separate fields and I used .melt() to make the data longer. The database that will ultimately take this in requires the longer format for the Lat/Long field.
GPSColList = list(GPSRecords.columns)
GPSColList.remove('Latitude')
GPSList.remove('Longitude')
GPSMelt = GPSRecords.melt(id_vars=GPSColList, value_vars=['Latitude', 'Longitude'], var_name='Lat/Long', value_name="GPSCoordinates")
As the two sets of coordinates were in separate fields I created two dataframes with each set of coordinates and melted them separately. My attempt to merge them looks like:
mergeMelt = pd.merge(GPSMelt, GeoCodeMelt[["GeoCodeCoordinates"]], on=['Incident_Number', 'Lat/Long'])
Result is KeyError: 'Incident_Number'
Adding samples as requested:
geocodeMelt:
print(geocodeMelt.head(10).to_dict())
{'OID_': {0: 5211, 1: 5212, 2: 5213, 3: 5214, 4: 5215, 5: 5216, 6: 5217, 7: 5218, 8: 5219, 9: 5220}, 'Unit_Level': {0: 'RRU (Riverside
Unit)', 1: 'RRU (Riverside Unit)', 2: 'RRU (Riverside Unit)', 3: 'RRU (Riverside Unit)', 4: 'RRU (Riverside Unit)', 5: 'RRU (Riverside
Unit)', 6: 'RRU (Riverside Unit)', 7: 'RRU (Riverside Unit)', 8: 'RRU (Riverside Unit)', 9: 'RRU (Riverside Unit)'}, 'Agency_FDID': {0: 33090, 1: 33051, 2: 33054, 3: 33054, 4: 33090, 5: 33070, 6: 33030, 7: 33054, 8: 33090, 9: 33052}, 'Incident_Number': {0: '21CARRU0000198', 1: '21CARRU0000564', 2: '21CARRU0000523', 3: '21CARRU0000624', 4: '21CARRU0000436', 5: '21CARRU0000439', 6: '21CARRU0000496', 7: '21CARRU0000422', 8: '21CARRU0000466', 9: '21CARRU0000016'}, 'Exposure': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}, 'CAD_Incident_Type': {0: '71', 1: '67B01O', 2: '71C01', 3: '69D03', 4: '67', 5: '67', 6: '71', 7: '69D06', 8: '71C01', 9: '82B01'}, 'CALFIRS_Incident_Type': {0: 'Passenger vehicle fire', 1: 'Outside rubbish, trash or waste fire', 2: 'Passenger vehicle fire', 3: 'Building fire', 4: 'Outside rubbish, trash or waste fire', 5: 'Outside rubbish, trash or waste fire', 6: 'Passenger vehicle fire', 7: 'Dumpster or other outside trash receptacle fire', 8: 'Passenger vehicle fire', 9: 'Brush or brush-and-grass mixture fire'}, 'Incident_Date': {0: '1/1/2021 0:00:00', 1: '1/1/2021 0:00:00', 2: '1/1/2021 0:00:00', 3: '1/1/2021 0:00:00', 4: '1/1/2021 0:00:00', 5: '1/1/2021 0:00:00', 6: '1/1/2021 0:00:00', 7: '1/1/2021 0:00:00', 8: '1/1/2021 0:00:00', 9: '1/1/2021 0:00:00'}, 'Report_Date_Time': {0: nan, 1: '1/1/2021 20:34:00', 2: '1/1/2021 19:07:00', 3: '1/1/2021 23:33:00', 4: nan, 5: '1/1/2021 16:56:00', 6: '1/1/2021 18:28:00', 7: '1/1/2021 16:16:00', 8: '1/1/2021 17:40:00', 9: '1/1/2021 0:15:00'}, 'Day': {0: '06 - Friday', 1: '06 - Friday', 2: '06 - Friday', 3: '06 - Friday', 4: '06 - Friday', 5: '06 - Friday', 6: '06 - Friday', 7: '06 - Friday', 8: '06 - Friday', 9: '06 - Friday'}, 'Incident_Name': {0: 'HY 91 W/ SERFAS CLUB DR', 1: 'QUAIL PL MENI', 2: 'CAR', 3: 'SUNNY', 4: 'MARTINEZ RD SANJ', 5: 'W METZ RD / ALTURA DR', 6: 'PALM DR / BUENA VISTA AV', 7: 'DELL', 8: 'HY 74 E HEM', 9: 'MADISON ST / AVE 60'}, 'Address': {0: 'HY 91 W Corona CA 92880', 1: '23880 KENNEDY LN Menifee CA 92587', 2: 'THEODORE ST/EUCALYPTUS AV Moreno Valley CA 92555', 3: '24490 SUNNYMEAD Moreno Valley CA 92553', 4: '40300 MARTINEZ San Jacinto CA 92583', 5: '1388 West METZ Perris CA 92570', 6: 'PALM DR/BUENA VISTA AV Desert hot springs CA 92240', 7: '25361 DELPHINIUM Moreno Valley CA 92553', 8: '43763 HY 74 East Hemet CA 92544', 9: 'MADISON ST/AVE 60 La Quinta CA 92253'}, 'Acres_Burned': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: 0.01}, 'Wildland_Fire_Cause': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: 'UU - Undetermined'}, 'Latitude_D': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7:
nan, 8: nan, 9: nan}, 'Longitude_D': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'Member_Making_Report': {0: 'Muhammad Nassar', 1: 'TODD PHILLIPS', 2: 'DAVID COLOMBO', 3: 'GREGORY MOWAT', 4: 'MICHAEL ESPARZA', 5: 'Benjamin Hall', 6: 'TIMOTHY CABRAL', 7: 'JORGE LOMELI', 8: 'JOSHUA BALBOA', 9: 'SETH SHIVELY'}, 'Battalion': {0: 4.0, 1: 13.0, 2: 9.0, 3: 9.0, 4: 5.0, 5: 1.0, 6: 10.0, 7: 9.0, 8: 5.0, 9: 6.0}, 'Incident_Status': {0: 'Submitted', 1: 'Submitted', 2: 'Submitted', 3: 'Submitted', 4: 'Submitted', 5: 'Submitted', 6: 'Submitted', 7: 'Submitted', 8: 'Submitted', 9: 'Submitted'}, 'DDLat': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'DDLon': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'DiscrepancyDistanceFeet': {0: 4178.0, 1: 107.0, 2: 2388.0, 3: 233159.0, 4: 102.0, 5: 1768.0, 6: 1094.0, 7: 78.0, 8: 35603721.0, 9: 149143.0}, 'DiscrepancyDistanceMiles': {0: 1.0, 1: 0.0, 2: 0.0, 3: 44.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 6743.0, 9: 28.0}, 'DiscrepancyGreaterThan1000ft': {0: 1.0, 1: 2.0, 2: 1.0, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: 2.0, 8: 1.0, 9: 1.0}, 'LocationLegitimate': {0: nan, 1: 1.0, 2: nan, 3: nan, 4: 1.0, 5: nan, 6: nan, 7: 1.0, 8: nan, 9: nan}, 'LocationErrorCategory': {0: nan, 1: 7.0, 2: nan, 3: nan, 4: 7.0,
5: nan, 6: nan, 7: 7.0, 8: nan, 9: nan}, 'LocationErrorComment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'LocationErrorResolution': {0: nan, 1: 6.0, 2: nan, 3: nan, 4: 6.0, 5: nan, 6: nan, 7: 6.0, 8: nan, 9: nan}, 'LocationErrorResolutionComment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'CADLatitudeDDM': {0: '33 53.0746416', 1: '33 42.3811205', 2: '33 55.9728055', 3: '33 56.3706594', 4: '33 47.9788195', 5: '33 47.6486387', 6: '33 57.5747994', 7: '33 54.3721212', 8: '33 44.8499992', 9: '33 38.1589793'}, 'CADLongitudeDDM': {0: '-117 38.2368024', 1: '-117 14.5374611', 2: '-117 07.9119009', 3: '-117 14.1319211', 4: '-116 57.4446600', 5: '-117 15.4013420', 6: '-116 30.2784078', 7: '-117 13.2052213', 8: '-116 53.8524596',
9: '-116 15.0473995'}, 'GeocodeSymbology': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2}, 'Lat/Long': {0: 'Latitude', 1: 'Latitude', 2: 'Latitude', 3: 'Latitude', 4: 'Latitude', 5: 'Latitude', 6: 'Latitude', 7: 'Latitude', 8: 'Latitude', 9: 'Latitude'}, 'CAD_Coords': {0: '33 52.924', 1: '33 42.364', 2: '33 56.100', 3: '33 93.991', 4: '33 47.9629', 5: '33 47.390', 6: '33 57.573', 7: '33 54.385', 8: '33 44.859', 9: '33 61.269'}}
and GPSMelt:
print(geocodeMelt.head(10).to_dict())
{'OID_': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}, 'Unit_Level': {0: 'RRU (Riverside Unit)', 1: 'RRU (Riverside Unit)', 2: 'RRU (Riverside Unit)', 3: 'RRU (Riverside Unit)', 4: 'RRU (Riverside Unit)', 5: 'RRU (Riverside Unit)', 6: 'RRU (Riverside Unit)', 7: 'RRU (Riverside Unit)', 8: 'RRU (Riverside Unit)', 9: 'RRU (Riverside Unit)'}, 'Agency_FDID': {0: 33090, 1: 33054, 2: 33030, 3: 33051, 4: 33054, 5: 33090, 6: 33070, 7: 33054, 8: 33090, 9: 33035}, 'Incident_Number': {0: '21CARRU0000198', 1: '21CARRU0000523', 2: '21CARRU0000496', 3: '21CARRU0000564', 4: '21CARRU0000624', 5: '21CARRU0000436', 6: '21CARRU0000439', 7: '21CARRU0000422', 8: '21CARRU0000466', 9: '21CARRU0000007'}, 'Exposure': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}, 'CAD_Incident_Type': {0: '71', 1: '71C01', 2: '71', 3: '67B01O', 4: '69D03', 5: '67', 6: '67', 7: '69D06', 8: '71C01', 9: '82C03'}, 'CALFIRS_Incident_Type': {0: 'Passenger vehicle fire', 1: 'Passenger vehicle fire', 2: 'Passenger vehicle fire', 3: 'Outside rubbish, trash or waste fire', 4: 'Building fire', 5: 'Outside rubbish, trash or waste fire', 6: 'Outside rubbish, trash or waste fire', 7: 'Dumpster or other outside trash receptacle fire', 8: 'Passenger vehicle fire', 9: 'Brush or brush-and-grass mixture fire'}, 'Incident_Date': {0: '1/1/2021 0:00:00', 1: '1/1/2021 0:00:00', 2: '1/1/2021 0:00:00', 3: '1/1/2021 0:00:00', 4: '1/1/2021 0:00:00', 5: '1/1/2021 0:00:00', 6: '1/1/2021 0:00:00', 7: '1/1/2021 0:00:00', 8: '1/1/2021 0:00:00', 9: '1/1/2021 0:00:00'}, 'Report_Date_Time': {0: nan, 1: '1/1/2021 19:07:00', 2: '1/1/2021 18:28:00', 3: '1/1/2021 20:34:00', 4: '1/1/2021 23:33:00', 5: nan, 6: '1/1/2021 16:56:00', 7: '1/1/2021 16:16:00', 8: '1/1/2021 17:40:00', 9: '1/1/2021 0:07:00'}, 'Day': {0: '06 - Friday', 1: '06 - Friday', 2: '06 - Friday', 3: '06 - Friday', 4: '06 - Friday', 5: '06 - Friday', 6: '06 - Friday', 7: '06 - Friday', 8: '06 - Friday', 9: '06 - Friday'}, 'Incident_Name': {0: 'HY 91 W/ SERFAS CLUB DR', 1: 'CAR', 2: 'PALM DR / BUENA VISTA AV', 3: 'QUAIL PL MENI', 4: 'SUNNY', 5: 'MARTINEZ RD SANJ', 6: 'W METZ RD / ALTURA DR', 7: 'DELL', 8: 'HY 74 E HEM', 9: 'RIVERSIDE DR / JOY ST'}, 'Address': {0: 'HY 91 W Corona CA 92880', 1: 'THEODORE ST/EUCALYPTUS AV Moreno Valley CA 92555', 2: 'PALM DR/BUENA VISTA AV Desert hot springs CA 92240', 3: '23880 KENNEDY LN Menifee CA 92587', 4: '24490 SUNNYMEAD Moreno Valley CA 92553', 5: '40300 MARTINEZ San Jacinto CA 92583', 6: '1388 West METZ Perris CA 92570', 7: '25361 DELPHINIUM Moreno Valley CA 92553', 8: '43763 HY 74 East Hemet CA 92544', 9: 'RIVERSIDE DR/JOY ST Lake Elsinore CA 92530'}, 'Acres_Burned': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: 1.0}, 'Wildland_Fire_Cause': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: 'Misuse of Fire by a Minor'}, 'Latitude_D': {0: 33.88206666666667, 1: 33.935, 2: 33.95955, 3: 33.706066666666665, 4: 34.566516666666665, 5: 33.79938166666667, 6: 33.789833333333334, 7: 33.906416666666665, 8: 33.74765, 9: 33.679883333333336}, 'Longitude_D': {0: -117.62385, 1: -117.13931666666667, 2: -116.50103333333333, 3: -117.2422, 4: -117.39321666666666, 5: -116.9573, 6: -117.254, 7: -117.22008333333332, 8: 116.89728333333332, 9: -117.37076666666665}, 'Member_Making_Report': {0: 'Muhammad Nassar', 1: 'DAVID COLOMBO', 2: 'TIMOTHY CABRAL', 3: 'TODD PHILLIPS', 4: 'GREGORY MOWAT', 5: 'MICHAEL ESPARZA', 6: 'Benjamin Hall', 7: 'JORGE LOMELI', 8: 'JOSHUA BALBOA', 9: 'KEVIN MERKH'}, 'Battalion': {0: 4.0, 1: 9.0, 2: 10.0, 3: 13.0, 4: 9.0, 5: 5.0, 6: 1.0, 7: 9.0, 8: 5.0, 9: 2.0}, 'Incident_Status': {0: 'Submitted', 1: 'Submitted', 2: 'Submitted', 3: 'Submitted', 4: 'Submitted', 5: 'Submitted', 6: 'Submitted', 7: 'Submitted', 8: 'Submitted', 9: 'Submitted'}, 'DDLat': {0: '33.88206667N', 1: '33.93500000N', 2: '33.95955000N', 3: '33.70606667N', 4: '34.56651667N', 5: '33.79938167N', 6: '33.78983333N', 7: '33.90641667N', 8: '33.74765000N', 9: '33.67988333N'}, 'DDLon': {0: '117.62385000W', 1: '117.13931667W', 2: '116.50103333W', 3: '117.24220000W', 4: '117.39321667W', 5: '116.95730000W', 6: '117.25400000W', 7: '117.22008333W', 8: '116.89728333E', 9: '117.37076667W'}, 'DiscrepancyDistanceFeet': {0: 4178.0, 1: 2388.0, 2: 1094.0, 3: 107.0, 4: 233159.0, 5: 102.0, 6: 1768.0, 7: 78.0, 8: 35603721.0, 9: 9298.0}, 'DiscrepancyDistanceMiles': {0: 1.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 44.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 6743.0, 9: 2.0}, 'DiscrepancyGreaterThan1000ft': {0: 1.0, 1: 1.0, 2: 1.0, 3: 2.0, 4: 1.0, 5: 2.0, 6: 1.0, 7: 2.0, 8: 1.0, 9: 1.0}, 'LocationLegitimate': {0: nan, 1: nan, 2: nan, 3: 1.0, 4: nan, 5: 1.0, 6: nan, 7: 1.0, 8: nan, 9: nan}, 'LocationErrorCategory': {0: nan, 1: nan, 2: nan, 3: 7.0, 4: nan, 5: 7.0, 6: nan, 7: 7.0, 8: nan, 9: nan}, 'LocationErrorComment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'LocationErrorResolution': {0: nan, 1: nan, 2: nan, 3: 6.0, 4: nan, 5: 6.0, 6: nan, 7: 6.0, 8: nan, 9: nan}, 'LocationErrorResolutionComment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'CADLatitudeDDM': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'CADLongitudeDDM': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'GeocodeSymbology': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}, 'Lat/Long': {0: 'Latitude', 1: 'Latitude', 2: 'Latitude', 3: 'Latitude', 4: 'Latitude', 5: 'Latitude', 6: 'Latitude', 7: 'Latitude', 8: 'Latitude', 9: 'Latitude'}, 'CALFIRS_Coords': {0: '33 52.924', 1: '33 56.100', 2: '33 57.573', 3: '33 42.364', 4: '33 93.991', 5: '33 47.9629', 6: '33 47.390', 7: '33 54.385', 8: '33 44.859', 9: '33 40.793'}}
Try:
cols = ['Incident_Number', 'Lat/Long', 'GeoCodeCoordinates']
mergeMelt = GPSMelt.merge(GeoCodeMelt[cols], on=cols[:-1])
The KeyError: 'Incident_Number' is raised because you use GeoCodeMelt[['GeoCodeCoordinates']] so your columns Incident_Number and Lat/Long don't exist when you merge.

python trying to select pandas columns that have spaces or ellipses in them, backticks not working

I'm trying to access a column in a pandas dataframe in python.
Here is the dataframe I am using. I only printed the first row in order to try to keep it a manageable size. I would have selected a smaller number of columns but you will see that this is my problem and the reason I'm asking this question.
{'OilSampleID': {0: 'TGSM3059'}, 'Type': {0: 'Oil Seep'}, 'Country': {0: 'Mexico'}, 'USGS Province': {0: 'Ocean'}, 'State / Province': {0: 'GOM'}, 'County / Block': {0: nan}, 'Block': {0: 'TGS Area 2'}, 'Section': {0: '11'}, 'Well': {0: 'VER115'}, 'Upper Depth (ft)': {0: nan}, 'Lower Depth (ft)': {0: nan}, 'Standardized Reservoir Age': {0: nan}, 'Reservoir Age': {0: nan}, 'Standardized Reservoir Formation': {0: nan}, 'Reservoir Formation': {0: nan}, 'Lithology': {0: nan}, 'Operator': {0: 'TDI-Brooks'}, 'Location': {0: nan}, 'Latitude': {0: 21.712965}, 'Longitude': {0: -96.280545}, 'Datum': {0: 'WGS84 UTM15N'}, 'API / UWI Well #': {0: nan}, 'Comments': {0: 'Piston Core Sediment'}, 'SampleID': {0: nan}, 'ClientID': {0: nan}, 'API Gravity': {0: nan}, '<C15': {0: nan}, '% S': {0: nan}, 'ppm Ni': {0: nan}, 'ppm V': {0: nan}, '% Sat': {0: 10.3448275862249}, '% Aro': {0: 13.7931034482896}, '% NSO': {0: 48.2758620689676}, '% Asph': {0: 27.5862068965179}, 'Sat / Aro': {0: 0.75}, '13Cs': {0: nan}, '13Ca': {0: -27.51}, '13Cwc': {0: nan}, '34Swc': {0: nan}, 'CV': {0: nan}, 'EOM': {0: 193.0}, 'Misc': {0: 'TSF = 150,000'}, 'Misc.1': {0: 150000.0}, 'SRType': {0: 'Marine Carbonate/Marl'}, 'SRTconf': {0: 'suspected'}, 'SRAge': {0: 'UJ/LK'}, 'SRAconf': {0: 'suspected'}, 'SRM': {0: nan}, 'BIOD': {0: 'Mild 4'}, 'BIOD_': {0: 4.0}, 'GM REF': {0: nan}, 'GM Fam': {0: nan}, 'C19/C23': {0: 0.018}, 'C22/C21': {0: 1.176}, 'C24/C23': {0: 0.405}, 'C26/C25': {0: 0.672}, 'Tet/C23': {0: 0.471}, 'C27T/C27': {0: 0.015}, 'C28/H': {0: 0.029}, 'C29/H': {0: 0.988}, 'X/H': {0: 0.013}, 'OL/H': {0: 0.009}, 'C31R/H': {0: 0.408}, 'GA/C31R': {0: 0.06}, 'C35S/C34S': {0: 1.131}, 'S/H': {0: 0.242}, '% C27': {0: 23.975}, '% C28': {0: 28.596}, '% C29': {0: 47.429}, 'S1/S6': {0: 0.887}, 'C29 20S/R': {0: 0.63}, 'C29 bbS/aaR': {0: 1.06}, 'C27 Ts/Tm': {0: 0.382}, 'C29 Ts/Tm': {0: 0.1}, 'DM/H': {0: 0.012}, 'C26(S+R) / Ts': {0: 0.316}, 'P': {0: 8.98723829264705}, '3MP': {0: 4.01903139633122}, '2MP': {0: 5.318803252166}, '9MP': {0: 5.38721229720994}, '1MP': {0: 3.85655991435188}, 'C4N': {0: 0.43610766215509}, 'DBT': {0: 0.0}, '4MDBT': {0: 0.256533918914759}, '32MDBT': {0: nan}, '1MDBT': {0: nan}, 'C20TAS': {0: 1.65036821168495}, 'C21TAS': {0: 2.32590753149381}, 'C26STAS': {0: 12.6898778556501}, 'C26RC27STAS': {0: 62.243679859351}, 'C28STAS': {0: 52.8801918189623}, 'C27RTAS': {0: 42.2254830533693}, 'C28RTAS': {0: 46.860195855096}, '#9b': {0: 8.0}, '#3e': {0: 11.0}, 'C18': {0: 0.0}, 'C19': {0: 1.0}, 'C20': {0: 1.0}, 'MPI': {0: 0.768292682926829}, 'F1': {0: 0.50253106304648}, 'F2': {0: 0.286240220892775}, 'P/DBT': {0: nan}, 'DBT/C4N': {0: nan}, 'MDR': {0: nan}, 'TAS1': {0: 0.0376145000974469}, 'TAS2': {0: 0.0472878998609179}, 'TAS3(CR)': {0: 0.0180023228803717}, 'TAS4': {0: 0.239974126778784}, 'TAS5': {0: 0.901094890510949}, 'DINO3/9': {0: 1.3740932642487}, 'C19T': {0: 0.181692648715433}, 'C20T': {0: 0.622946224167199}, 'C21T': {0: 1.62225579210208}, 'C22T': {0: 1.90777281151205}, 'C23T': {0: 9.92820544766473}, 'C24T': {0: 4.02319436441316}, 'C25S': {0: 2.336048340627}, 'C25R': {0: 2.336048340627}, 'TET': {0: 4.67209668125399}, 'C26S': {0: 1.60927774576526}, 'C26R': {0: 1.53140946774436}, 'ET_C28S': {0: 1.03824370694533}, 'ET_C28R': {0: 1.23291440199758}, 'ET_C29S': {0: 2.01159718220658}, 'ET_C29R': {0: 3.02388479647828}, 'ET_C30S': {0: 1.36269486536575}, 'ET_C30R': {0: 5.84012085156749}, 'ET_C31S': {0: 1.14206807763986}, 'ET_C31R': {0: 0.584012085156749}, 'Ts': {0: 9.92820544766473}, 'C27T': {0: 0.545077946146299}, 'Tm': {0: 26.0209829053174}, 'C28DM': {0: 2.40093857231108}, 'C28H': {0: 2.79027996241558}, 'C29DM': {0: 1.12909003130305}, 'C29H': {0: 96.2451916338322}, 'C29D': {0: 9.60375428924431}, 'C30X': {0: 1.29780463368166}, 'OL': {0: 0.882507150903532}, 'C30H': {0: 97.4391718968193}, 'C30M': {0: 8.70826909200397}, 'C31S': {0: 56.6751283528783}, 'C31R': {0: 39.7387778833326}, 'GA': {0: 2.38796052597426}, 'C32S': {0: 27.461546048704}, 'C32R': {0: 18.0135283155015}, 'C33S': {0: 20.6610497682121}, 'C33R': {0: 12.6146610393858}, 'C34S': {0: 12.458924483344}, 'C34R': {0: 7.7219375704059}, 'C35S': {0: 14.0941583217829}, 'C35R': {0: 8.53955448962535}, 'S1': {0: 7.30404447836041}, 'S2': {0: 4.12442312584033}, 'S3': {0: 4.85119372070206}, 'S4': {0: 11.337621279843}, 'S4B': {0: 10.4109887713943}, 'S5': {0: 6.32290417529707}, 'S5B': {0: 7.54024492169047}, 'S6': {0: 8.23067698680911}, 'S7': {0: 2.54369708201606}, 'S8': {0: 3.83371488789564}, 'S9': {0: 6.23205785093935}, 'S9B': {0: 8.394200370653}, 'S10': {0: 7.19502888913115}, 'S10B': {0: 8.99378611141393}, 'S11': {0: 3.67019150405175}, 'S12': {0: 7.4675678622043}, 'S13': {0: 14.0085032159599}, 'S13B': {0: 16.0071223518296}, 'S14': {0: 12.4641157018787}, 'S14B': {0: 14.916966459537}, 'ISTD': {0: 500.0}, 'S15': {0: 11.7918529016316}, 'Biodegraded': {0: nan}, '15': {0: nan}, '16': {0: nan}, '17': {0: nan}, 'Pr': {0: nan}, '18': {0: nan}, 'Ph': {0: nan}, '19': {0: nan}, '20': {0: nan}, '21': {0: nan}, '22': {0: nan}, '23': {0: nan}, '24': {0: nan}, '25': {0: nan}, '26': {0: nan}, '27': {0: nan}, '28': {0: nan}, '29': {0: nan}, '30': {0: nan}, '31': {0: nan}, '32': {0: nan}, '33': {0: nan}, '34': {0: nan}, '35': {0: nan}, 'Pr/Ph': {0: nan}, 'Pr/nC17': {0: nan}, 'Ph/nC18': {0: nan}, 'nC27/nC17': {0: nan}, 'nC19*2/(nC18+nC20)': {0: nan}, 'CPI': {0: nan}, 'S1/S6.1': {0: 0.887}, 'DWW D/R': {0: 1.1739236190043156}, 'DWW Ts/(Ts+Tm)': {0: 0.27617328519855566}, 'DWW 35SH/34SH': {0: 1.131}, 'DWW 29H/H': {0: 0.988}, 'DWW OL/H': {0: 0.009}, 'DWW GA/H': {0: 0.02450719232818326}, 'DWW 31H/H': {0: 0.9894778902504008}, 'DWW 29H/31H': {0: 0.9982501009557132}, 'DWW %27 St': {0: 23.975}, 'DWW %28 St': {0: 28.596}, 'DWW %29 St': {0: 47.429}, 'DWW M/H': {0: 0.08937133724027711}, 'DWW St/Ho': {0: 0.242}, 'DWW 29S/R': {0: 0.63}, 'Family': {0: 2}}
I looked on this StackOverflow post to try and find the answer, but I think I'm running into problems because my data has headers with spaces with them.
I followed the other example to try and select just some of the columns. Here is my code to try and select only a few.
geology_data_selection = geology_data[['OilSampleID', 'Type', 'Country', 'USGS Province', 'Well', 'Latitude', 'Longitude', 'EOM', 'Misc...43', 'BIOD', '% Sat', '% Aro', '% NSO', '% Asph']]
I was thinking the problem is that some of these columns like the % Sat column has a space in it. One of the comments under that SO post told me to use backpacks, so I tried using backpacks but that didn't work either.
geology_data_selection = geology_data[['OilSampleID', 'Type', 'Country', '`USGS Province`', 'Well', 'Latitude', 'Longitude', 'EOM', 'Misc...43', 'BIOD', '`% Sat`', '`% Aro`', '`% NSO`', '`% Asph`']]
How can I select columns from a pandas data frame explicitly when the column names are all that great (some of them with the ... and some of them with the spaces).
If you supply keys that exist... this works just fine.
df[['OilSampleID', 'Type', 'Country', 'USGS Province', 'Well', 'Latitude', 'Longitude', 'EOM', 'BIOD', '% Sat', '% Aro', '% NSO', '% Asph']]
Output:
OilSampleID Type Country USGS Province Well Latitude Longitude EOM BIOD % Sat % Aro % NSO % Asph
0 TGSM3059 Oil Seep Mexico Ocean VER115 21.712965 -96.280545 193.0 Mild 4 10.344828 13.793103 48.275862 27.586207

how to aggregate columns based on the value of others

If i had a dataframe such as this, how would i create aggragtes such as min,max and mean for each Port for each given year?
df1 = pd.DataFrame({'Year': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4:2019},'Port': {0: 'NORTH SHIELDS', 1: 'NORTH SHIELDS' 2: 'NORTH SHIELDS', 3: 'NORTH SHIELDS', 4: 'NORTH SHIELDS'},'Vessel capacity units': {0: 760.5, 1: 760.5, 2: 760.5, 3: 760.5, 4: 760.5},'Engine power': {0: 790.0, 1: 790.0, 2: 790.0, 3: 790.0, 4: 790.0},'Registered tonnage': {0: 516.0, 1: 516.0, 2: 516.0, 3: 516.0, 4: 516.0},'Overall length': {0: 45.0, 1: 45.0, 2: 45.0, 3: 45.0, 4: 45.0},'Value(£)': {0: 2675.81, 1: 62.98, 2: 9.67, 3: 527.02, 4: 2079.0}, 'Landed Weight (tonnes)': {0: 0.978,1: 0.0135, 2: 0.001, 3: 0.3198, 4: 3.832}})
df1
IIUC
df.groupby(['PORT', 'YEAR'])['<WHATEVER COLUMN HERE>'].agg(['count', 'min', 'max', 'mean']) #groupys by 'PORT', 'YEAR' and finds the multiple arguments of count, min, max, and mean
Without any kind of background information this questions is tricky. Would you want it for every year or just some given years?
To extract min/max/mean etc is quite straightforward. I assume that you have some kind of datafile and have extracted a df from there:
file = 'my-data.csv' # the data file
df = pd.read_csv(file)
VALUE_I_WANT_TO_EXTRAXT = df['Column name']
Then for each port you can extract the min/max/mean data like this.
for i in range(Port):
print( i, np.min(VALUE_I_WANT_TO_EXTRAXT) )
But, as I said. Without any kind of specifik knowledge about the problem it is hard to provide a solution

Resolving a value is trying to be set on a copy of a slice error

Trying to resolve the error:
application.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
application.py:26: SettingWithCopyWarning:
but can't figure out why i'm getting this error and how to resolve it.
This is my code:
hr = hr_data[['Month','SalesSystemCode','TITULO','BirthDate','HireDate','SupervisorEmployeeID','BASE','carallowance','Commission_Target','Area','Fulfilment %','Commission Accrued','Commission paid',
'Características (D)', 'Características (I)', 'Características (S)','Características (C)', 'Motivación (D)', 'Motivación (I)','Motivación (S)', 'Motivación (C)', 'Bajo Stress (D)',
'Bajo Stress (I)', 'Bajo Stress (S)', 'Bajo Stress (C)']]
sales = sales_data[['Report month', 'Area','Customer','Rental Charge','Cod. Motivo Desconexion','ID Vendedor']]
#report month to datetime
sales['Report month'] = pd.to_datetime(sales['Report month'])
hr['Month'] = pd.to_datetime(hr['Month'])
#remove sales where customer churned
sales_clean = sales.loc[sales['Cod. Motivo Desconexion'] == 0]
sales_clean = sales_clean[['Report month','Rental Charge','ID Vendedor']]
sales_clean2 = pd.DataFrame(sales_clean.groupby(['Report month','ID Vendedor'])['Rental Charge'].sum())
sales_clean2.reset_index(inplace=True)
hr_area = hr.loc[hr['Area'] == 'Area 1']
merged_hr = hr_area.merge(sales_clean, left_on=['SalesSystemCode','Month'],right_on=['ID Vendedor','Report month'],how='left')
#creating new features: months of employment
merged_hr['MonthsofEmploymentRounded'] = round((merged_hr['Month'] - merged_hr['HireDate'])/np.timedelta64(1,'M')).astype('int')
#filters for interaction
YEAR_MONTH = merged_hr['Month'].unique()
#css stylesheet
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
#html layout
app.layout = html.Div(children=[
html.H1(children='SAC Challenge Level 2 Dashboard', style ={
'textAlign': 'center',
'height':'10'
}),
html.Div(children='''
Objective: Studying the impact of supervision on the performance of sales executives in Area 1
'''),
dcc.DatePickerRange(
id='year_month',
start_date= min(merged_hr['Month'].dt.date.tolist()),
end_date = 'Select date'
),
dcc.Graph(
id='performancetable'
)
])
#app.callback(dash.dependencies.Output('performancetable','figure'),
[dash.dependencies.Input('year_month', 'start_date'),
dash.dependencies.Input('year_month','end_date')])
def update_table(year_month):
if year_month is None or year_month ==[]:
year_month = YEAR_MONTH
performance = merged_hr[(merged_hr['Month'].isin(year_month))]
return {
'data': [
go.Table(
header = dict(values=list(performance.columns),fill_color='paleturquoise',align='left'),
cells = dict(values=[performance['Month'],performance['SalesSystemCode'],performance['TITULO'],
performance['HireDate'],performance['MonthsofEmploymentRounded'],performance['SupervisorEmployeeID'],
performance['BASE'],performance['carallowance'],performance['Commission_Target'],
performance['Fulfilment %'], performance['Commission Accrued'],performance['Commission paid'],
performance['Características (D)'],performance['Características (I)'],performance['Características (S)'],
performance['Características (C)'],performance['Motivación (D)'],performance['Motivación (I)'],
performance['Motivación (S)'],performance['Motivación (C)'],performance['Bajo Stress (D)'],
performance['Bajo Stress (I)'],performance['Bajo Stress (S)'],performance['Bajo Stress (C)'],
performance['Rental Charge']])
)],
}
if __name__ == '__main__':
app.run_server(debug=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here is a sample of hr_data:
{'Month': {0: Timestamp('2017-12-01 00:00:00'),
1: Timestamp('2017-12-01 00:00:00'),
2: Timestamp('2017-12-01 00:00:00'),
3: Timestamp('2017-12-01 00:00:00'),
4: Timestamp('2017-12-01 00:00:00')},
'EmployeeID': {0: 91868, 1: 1812496, 2: 1812430, 3: 700915, 4: 1812581},
'PayrollProviderName': {0: 'Tele',
1: 'People',
2: 'People',
3: 'Stratego',
4: 'People'},
'SalesSystemCode': {0: 91868.0,
1: 802496.0,
2: 2430.0,
3: 700915.0,
4: 802581.0},
'Payroll Type': {0: 'Insourcing',
1: 'Third Party',
2: 'Third Party',
3: 'Third Party',
4: 'Third Party'},
'Name': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'TITULO': {0: 'SALES SUPERVISOR',
1: 'SALES EXECUTIVE',
2: 'SALES EXECUTIVE',
3: 'SALES EXECUTIVE',
4: 'SALES EXECUTIVE'},
'Sexo': {0: 'M', 1: 'F', 2: 'F', 3: 'M', 4: 'F'},
'BirthDate': {0: Timestamp('1982-11-05 00:00:00'),
1: Timestamp('1987-09-24 00:00:00'),
2: Timestamp('1981-01-13 00:00:00'),
3: Timestamp('1986-04-18 00:00:00'),
4: Timestamp('1991-06-24 00:00:00')},
'HireDate': {0: Timestamp('2012-04-23 00:00:00'),
1: Timestamp('2017-04-10 00:00:00'),
2: Timestamp('2017-03-13 00:00:00'),
3: Timestamp('2015-01-22 00:00:00'),
4: Timestamp('2017-05-18 00:00:00')},
'SupervisorEmployeeID': {0: 7935, 1: 91868, 2: 91868, 3: 91868, 4: 91868},
'SupervisorName': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'BASE': {0: 895, 1: 700, 2: 700, 3: 700, 4: 700},
'carallowance': {0: 350, 1: 250, 2: 250, 3: 250, 4: 250},
'Commission_Target': {0: 708.33, 1: 583.33, 2: 583.33, 3: 583.33, 4: 583.33},
'Nacionalidad': {0: 'INT', 1: 'INT', 2: 'INT', 3: 'INT', 4: 'INT'},
'Area': {0: 'Area 1', 1: 'Area 1', 2: 'Area 1', 3: 'Area 1', 4: 'Area 1'},
'Comment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Sales Quota (points)': {0: 1810.0, 1: 108.0, 2: 108.0, 3: 108.0, 4: 108.0},
'Real (points)': {0: 1855.0, 1: 86.0, 2: 245.0, 3: 149.0, 4: 91.0},
'Fulfilment %': {0: 1.0248618784530388,
1: 0.7962962962962963,
2: 2.2685185185185186,
3: 1.3796296296296295,
4: 0.8425925925925926},
'Commission Accrued': {0: 708.33, 1: 583.33, 2: 583.33, 3: 583.33, 4: 583.33},
'OA Commission Accrued': {0: 653.66,
1: 87.5,
2: 1494.79,
3: 794.79,
4: 160.42},
'Clawback': {0: 0.0, 1: 24.33, 2: 144.9, 3: 36.77, 4: 0.0},
'Other Commissions': {0: 0.0, 1: 0.0, 2: 9.16, 3: 9.16, 4: 0.0},
'Commission paid': {0: 1361.99, 1: 646.51, 2: 1942.38, 3: 1350.52, 4: 743.75},
'Exit Date': {0: NaT,
1: Timestamp('2018-04-13 00:00:00'),
2: NaT,
3: NaT,
4: Timestamp('2018-08-31 00:00:00')},
'Legal Motive': {0: nan,
1: 'Artículo No. 212',
2: nan,
3: nan,
4: 'Artículo No. 212'},
'Características (D)': {0: nan, 1: 70.0, 2: 70.0, 3: 60.0, 4: 67.0},
'Características (I)': {0: nan, 1: 95.0, 2: 62.0, 3: 25.0, 4: 15.0},
'Características (S)': {0: nan, 1: 20.0, 2: 48.0, 3: 75.0, 4: 40.0},
'Características (C)': {0: nan, 1: 25.0, 2: 34.0, 3: 85.0, 4: 94.0},
'Motivación (D)': {0: nan, 1: 85.0, 2: 75.0, 3: 40.0, 4: 59.0},
'Motivación (I)': {0: nan, 1: 95.0, 2: 74.0, 3: 74.0, 4: 25.0},
'Motivación (S)': {0: nan, 1: 11.0, 2: 58.0, 3: 65.0, 4: 65.0},
'Motivación (C)': {0: nan, 1: 7.0, 2: 33.0, 3: 84.0, 4: 93.0},
'Bajo Stress (D)': {0: nan, 1: 60.0, 2: 69.0, 3: 79.0, 4: 79.0},
'Bajo Stress (I)': {0: nan, 1: 86.0, 2: 60.0, 3: 6.0, 4: 18.0},
'Bajo Stress (S)': {0: nan, 1: 40.0, 2: 60.0, 3: 89.0, 4: 30.0},
'Bajo Stress (C)': {0: nan, 1: 60.0, 2: 48.0, 3: 84.0, 4: 92.0}}
sales_data:
{'Month': {0: Timestamp('2017-07-01 00:00:00'),
1: Timestamp('2017-07-01 00:00:00'),
2: Timestamp('2017-07-01 00:00:00'),
3: Timestamp('2017-07-01 00:00:00'),
4: Timestamp('2017-07-01 00:00:00')},
'Report month': {0: '2017-07',
1: '2017-07',
2: '2017-07',
3: '2017-07',
4: '2017-07'},
'Area': {0: 'Area 1', 1: 'Area 1', 2: 'Area 1', 3: 'Area 1', 4: 'Area 1'},
'Fecha de solicitud': {0: Timestamp('2017-07-25 14:49:51'),
1: Timestamp('2017-07-25 14:56:14'),
2: Timestamp('2017-06-30 13:07:10'),
3: Timestamp('2017-07-03 18:25:17'),
4: Timestamp('2017-07-04 09:56:24')},
'Fecha de salida': {0: Timestamp('2017-07-27 13:11:42'),
1: Timestamp('2017-07-27 15:08:39'),
2: Timestamp('2017-07-04 11:50:07'),
3: Timestamp('2017-07-07 16:40:44'),
4: Timestamp('2017-07-14 14:52:45')},
'Fecha de salida final': {0: Timestamp('2017-07-28 15:13:53'),
1: Timestamp('2017-07-27 15:46:16'),
2: Timestamp('2017-07-05 10:24:46'),
3: Timestamp('2017-07-08 08:36:43'),
4: Timestamp('2017-07-15 10:00:02')},
'Fecha de proceso': {0: Timestamp('2017-08-01 00:00:00'),
1: Timestamp('2017-08-01 00:00:00'),
2: Timestamp('2017-08-01 00:00:00'),
3: Timestamp('2017-08-01 00:00:00'),
4: Timestamp('2017-08-01 00:00:00')},
'Fecha de sistema': {0: Timestamp('2017-07-25 14:49:51'),
1: Timestamp('2017-07-25 14:56:14'),
2: Timestamp('2017-06-30 13:07:10'),
3: Timestamp('2017-07-03 18:25:17'),
4: Timestamp('2017-07-04 09:56:24')},
'Fecha de completada': {0: Timestamp('2017-07-28 15:13:52'),
1: Timestamp('2017-07-27 15:46:15'),
2: Timestamp('2017-07-05 10:24:45'),
3: Timestamp('2017-07-08 08:36:42'),
4: Timestamp('2017-07-15 10:00:02')},
'Fecha de creada': {0: Timestamp('2017-07-25 14:50:00'),
1: Timestamp('2017-07-25 14:56:00'),
2: Timestamp('2017-06-30 13:07:00'),
3: Timestamp('2017-07-03 18:25:00'),
4: Timestamp('2017-07-04 09:56:00')},
'Cod. de Distribucion': {0: 2302, 1: 2302, 2: 2302, 3: 91818, 4: 2302},
'Customer': {0: 19308378, 1: 19308378, 2: 27504455, 3: 27104497, 4: 17608676},
'Cod. Tipo Cliente': {0: 'R', 1: 'R', 2: 'R', 3: 'R', 4: 'R'},
'Tipo De Cliente': {0: 'Residencial ',
1: 'Residencial ',
2: 'Residencial ',
3: 'Residencial ',
4: 'Residencial '},
'Cuenta': {0: 193083780000,
1: 193083780000,
2: 275044550000,
3: 271044970000,
4: 176086760000},
'Status Cuenta': {0: 'W', 1: 'W', 2: 'W', 3: 'W', 4: 'W'},
'Tipo de Contabilidad': {0: 'RP', 1: 'RP', 2: 'RP', 3: 'RP', 4: 'RP'},
'Desc. Tipo Contabilidad': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Tos Cat': {0: 'K', 1: 'K', 2: 'K', 3: 'K', 4: 'K'},
'Desc. Tos Cat': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Mktg Cat': {0: 990005.0, 1: 990005.0, 2: 990000.0, 3: 990000.0, 4: 990000.0},
'Desc. Mktg Cat': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Cod. Bill Sort': {0: 571.0, 1: 571.0, 2: 571.0, 3: 691.0, 4: 256.0},
'Orden de Servicio': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Comando': {0: 'PMO', 1: 'PFB', 2: 'PMO', 3: 'PMO', 4: 'PMO'},
'Desc. Comando': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Prioridad': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5},
'Cod. Línea': {0: 3, 1: 2, 2: 1, 3: 1, 4: 1},
'Número de Servicio': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Producto': {0: 1420, 1: 31000, 2: 1403, 3: 1404, 4: 1404},
'Desc. Producto': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Familia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Sub Familia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Rental Charge': {0: 22.5,
1: 18.7125,
2: 15.257499999999999,
3: 19.95,
4: 19.95},
'Inst Charge': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'Control': {0: 'CONEXIONES_COMPLETADAS_CT',
1: 'CONEXIONES_COMPLETADAS_CT',
2: 'CONEXIONES_COMPLETADAS',
3: 'CONEXIONES_COMPLETADAS',
4: 'CONEXIONES_COMPLETADAS'},
'Cod. Estatus': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A'},
'Status': {0: 'Por Acción ',
1: 'Por Acción ',
2: 'Por Acción ',
3: 'Por Acción ',
4: 'Por Acción '},
'Cod Razon Pendiente': {0: ' ', 1: ' ', 2: ' ', 3: ' ', 4: ' '},
'Razon Pendiente': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Cod. Motivo Desconexion': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Motivo Desconexion': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Cod. Agencia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Agencia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'ID Vendedor': {0: 2352.0, 1: 2352.0, 2: 2352.0, 3: 2352.0, 4: 2352.0},
'ID Oficinista': {0: 229113.0,
1: 229113.0,
2: 224666.0,
3: 221532.0,
4: 224666.0},
'ID Acct Manager': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'Desc. Acct Manager': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Provincia': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'},
'Central': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Chrg Prod Ant': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Tipo Srv': {0: 'MO', 1: 'TI', 2: 'MO', 3: 'MO', 4: 'MO'},
'Tipo Srv Desc': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Diferencia ': {0: 2.5500000000000007,
1: 0.0,
2: 15.257499999999999,
3: 19.95,
4: 19.95},
'Puntos ': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
#QuanHoang was pointing in the right direction with his comment, but you need to add .copy() for both the hr and sales dataframes:
hr = hr_data[['Month','SalesSystemCode','TITULO','BirthDate','HireDate','SupervisorEmployeeID','BASE','carallowance','Commission_Target','Area','Fulfilment %','Commission Accrued','Commission paid',
'Características (D)', 'Características (I)', 'Características (S)','Características (C)', 'Motivación (D)', 'Motivación (I)','Motivación (S)', 'Motivación (C)', 'Bajo Stress (D)',
'Bajo Stress (I)', 'Bajo Stress (S)', 'Bajo Stress (C)']].copy()
sales = sales_data[['Report month', 'Area','Customer','Rental Charge','Cod. Motivo Desconexion','ID Vendedor']].copy()
Using .copy() works because it creates a full copy of the data, rather than a view. Subsequent indexing operations work correctly on the copy.
Another option is to use .loc[] indexing when you do the selection from hr_data and sales_data. This should also work:
hr = hr_data.loc[:, ['Month','SalesSystemCode','TITULO','BirthDate','HireDate','SupervisorEmployeeID','BASE','carallowance','Commission_Target','Area','Fulfilment %','Commission Accrued','Commission paid',
'Características (D)', 'Características (I)', 'Características (S)','Características (C)', 'Motivación (D)', 'Motivación (I)','Motivación (S)', 'Motivación (C)', 'Bajo Stress (D)',
'Bajo Stress (I)', 'Bajo Stress (S)', 'Bajo Stress (C)']]
sales = sales_data.loc[:, ['Report month', 'Area','Customer','Rental Charge','Cod. Motivo Desconexion','ID Vendedor']]
Note that selecting columns with .loc[] uses the format df.loc[:, [ *columns* ] becasue .loc[] requires specifying the rows explicitly.
Using .loc[] works because .loc[] (and .iloc[]) indexing return a reference to the original dataframe, but with updated indexing behavior which is not subject to the 'setting with copy' problems.

Pandas DataFrame: selecting columns does not work

I have a DataFrame with tweets. I want to select only two specific columns ('id' and 'text'). However, I keep having problems with the 'id' column. This works:
import pandas as pd
tweets = pd.read_csv('alltweets.csv')
specified_tweets = tweets[['text']]
But this gives an error:
specified_tweets = tweets[['id','text']]
KeyError: "['id'] not in index"
But 'id' is definitely in the index:
tweets.columns
Index(['id', 'time', 'created_at', 'from_user_name', 'text', 'filter_level',
'possibly_sensitive', 'withheld_copyright', 'withheld_scope',
'truncated', 'retweet_count', 'favorite_count', 'lang', 'to_user_name',
'in_reply_to_status_id', 'quoted_status_id', 'source', 'location',
'lat', 'lng', 'from_user_id', 'from_user_realname',
'from_user_verified', 'from_user_description', 'from_user_url',
'from_user_profile_image_url', 'from_user_utcoffset',
'from_user_timezone', 'from_user_lang', 'from_user_tweetcount',
'from_user_followercount', 'from_user_friendcount',
'from_user_favourites_count', 'from_user_listed',
'from_user_withheld_scope', 'from_user_created_at'],
dtype='object')
EDIT:
This is what the data looks like:
{'created_at': {0: '2018-02-13 13:14:08', 2: '2018-02-13 13:14:23'},
'favorite_count': {0: 0, 2: 0},
'filter_level': {0: 'low', 2: 'low'},
'from_user_created_at': {0: '2011-07-28 13:56:37', 2: '2017-10-14 13:21:03'},
'from_user_description': {0: "Feyenoord..... en me lieverd natuurlijk! Anti islam, lasciate ogne speranza voi ch'entrate", 2: "The world has 2 mayor problems: SSocialism and isLam. Without those ideologies we didn't have wars, mass migration and terrorism. http://Gab.ai/Diver"},
'from_user_favourites_count': {0: 3630, 2: 0},
'from_user_followercount': {0: 594, 2: 479},
'from_user_friendcount': {0: 592, 2: 524},
'from_user_id': {0: 344062208, 2: 919191322162024448},
'from_user_lang': {0: 'nl', 2: 'nl'},
'from_user_listed': {0: 129, 2: 1},
'from_user_name': {0: 'Ratatouile1', 2: 'DuikerT3'},
'from_user_realname': {0: 'Rat', 2: 'Gab.ai/Diver'},
'from_user_timezone': {0: 'Amsterdam', 2: 'Hanoi'},
'from_user_tweetcount': {0: 14077, 2: 17775},
'from_user_url': {0: nan, 2: 'url'},
'from_user_utcoffset': {0: 3600.0, 2: 25200.0},
'from_user_verified': {0: 0, 2: 0},
'from_user_withheld_scope': {0: nan, 2: nan},
'in_reply_to_status_id': {0: nan, 2: nan},
'lang': {0: 'nl', 2: 'nl'},
'lat': {0: nan, 2: nan},
'lng': {0: nan, 2: nan},
'location': {0: 'Schiedam', 2: 'Thailand'},
'possibly_sensitive': {0: nan, 2: nan},
'quoted_status_id': {0: nan, 2: nan},
'retweet_count': {0: 0, 2: 0},
'source': {0: 'Twitter for Android', 2: 'Twitter Web Client'},
'text': {0: 'RT #Derksen_Gelul: Dus #Zijlstra loog en toen hij betrapt werd kwam hij weer met een nieuwe leugen, dit kan zelfs #Rutte niet meer rec… ', 2: 'RT #MikevdGalienNL: Nee, #Rutte, de inhoud van het verhaal van #Zijlstra staat niet. Het is 100% duidelijk dat hij de boel helemaal bij… '},
'time': {0: 1518527648, 2: 1518527663},
'to_user_name': {0: nan, 2: nan},
'truncated': {0: nan, 2: nan},
'withheld_copyright': {0: nan, 2: nan},
'withheld_scope': {0: nan, 2: nan},
'\ufeffid': {0: 963400901305217024, 2: 963400963934642178}}

Categories