I am trying to inner merge two dask dataframes based on two ids namely doi and pmid.
The datasets look like this (only the head, feel free to modify the doi and pmid to construct a MWE):
dd_papers_all:
{'cited_by_count': {0: nan, 1: nan, 2: 9.0, 3: nan, 4: 30.0},
'cited_by_url': {0: 'None',
1: 'None',
2: "['W1968224982', 'W1977435724', 'W2003814720', 'W2006453929', 'W2015063028', 'W2139614344']",
3: 'None',
4: "['W181218938', 'W1969520123', 'W1970043627', 'W1977525191', 'W2006834484', 'W2057850214', 'W2062554850', 'W2070252209', 'W2098074569', 'W2123616561', 'W2150154625', 'W2408116868']"},
'authors': {0: "['Walczak M', 'Pawlaczyk J']",
1: "['Ioan Oliver Avram']",
2: "['T.M. Dmitrieva', 'T.P. Eremeeva', 'G.I. Alatortseva', 'Vadim I. Agol']",
3: "['Djurdjina Ružić', 'Tatjana Vujović', 'Gabriela Libiaková', 'Radosav Cerović', 'Alena Gajdošová']",
4: "['M. Harris']"},
'institutions': {0: '[]',
1: '[]',
2: "['Lomonosov Moscow State University', 'USSR Academy of Medical Sciences', 'Lomonosov Moscow State University', 'USSR Academy of Medical Sciences']",
3: "['Fruit Research Institute', 'Fruit Research Institute', 'Institute of Plant Genetics and Biotechnology', 'Fruit Research Institute', 'Institute of Plant Genetics and Biotechnology']",
4: "['Durban University of Technology']"},
'paper_id': {0: 'W155261221',
1: 'W145424619',
2: 'W1482328891',
3: 'W1581876373',
4: 'W1978891149'},
'pub_year_col': {0: 1969, 1: 2010, 2: 1980, 3: 2012, 4: 2008},
'level_0': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'level_1': {0: "['Ophthalmology', 'Pediatrics', 'Internal medicine', 'Endocrinology', 'Anatomy', 'Developmental psychology']",
1: "['Risk analysis (engineering)', 'Manufacturing engineering', 'Industrial engineering', 'Operations research', 'Mechanical engineering', 'Epistemology', 'Climatology', 'Macroeconomics', 'Operating system']",
2: "['Virology', 'Cell biology', 'Biochemistry', 'Quantum mechanics']",
3: "['Horticulture', 'Botany', 'Biochemistry']",
4: "['Pedagogy', 'Mathematics education', 'Paleontology', 'Social science', 'Neuroscience', 'Programming language', 'Quantum mechanics']"},
'level_2': {0: "['Craniopharyngioma', 'Fundus (uterus)', 'Girl', 'Ventricle', 'Fourth ventricle']",
1: "['Process (computing)', 'Production (economics)', 'Machine tool', 'Quality (philosophy)', 'Productivity', 'Multiple-criteria decision analysis', 'Machining', 'Forcing (mathematics)']",
2: "['Mechanism (biology)', 'Replication (statistics)', 'Virus', 'Gene']",
3: "['Vaccinium', 'In vitro']",
4: "['Negotiation', 'Reflective practice', 'Construct (python library)', 'Action research', 'Context (archaeology)', 'Reflective writing', 'Perception', 'Power (physics)']"},
'doi': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'pmid': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'mag': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
and dd_green_papers_frontier:
{'paperid': {0: 2006817976,
1: 2006817976,
2: 1972698438,
3: 1968223008,
4: 2149313415},
'uspto': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'doi': {0: '10.1016/J.AB.2003.07.015',
1: '10.1016/J.AB.2003.07.015',
2: '10.1007/S002170100404',
3: '10.1007/S002170100336',
4: '10.3324/%X'},
'pmid': {0: 14656521.0, 1: 14656521.0, 2: nan, 3: nan, 4: 12414351.0},
'publn_nr_x': {0: 2693, 1: 2693, 2: 2693, 3: 2693, 4: 2715},
'paperyear': {0: 2003.0, 1: 2003.0, 2: 2001.0, 3: 2001.0, 4: 2002.0},
'papertitle': {0: 'development of melting temperature based sybr green i polymerase chain reaction methods for multiplex genetically modified organism detection',
1: 'development of melting temperature based sybr green i polymerase chain reaction methods for multiplex genetically modified organism detection',
2: 'event specific detection of roundup ready soya using two different real time pcr detection chemistries',
3: 'characterisation of the roundup ready soybean insert',
4: 'hepatitis c virus infection in a hematology ward evidence for nosocomial transmission and impact on hematologic disease outcome'},
'magfieldid': {0: 149629.0,
1: 149629.0,
2: 143660080.0,
3: 40767140.0,
4: 2780572000.0},
'oecd_field': {0: '2. Engineering and Technology',
1: '2. Engineering and Technology',
2: '1. Natural Sciences',
3: '1. Natural Sciences',
4: '3. Medical and Health Sciences'},
'oecd_subfield': {0: '2.11 Other engineering and technologies',
1: '2.11 Other engineering and technologies',
2: '1.06 Biological sciences',
3: '1.06 Biological sciences',
4: '3.02 Clinical medicine'},
'wosfield': {0: 'Food Science & Technology',
1: 'Food Science & Technology',
2: 'Biochemical Research Methods',
3: 'Biochemistry & Molecular Biology',
4: 'Hematology'},
'author': {0: 2083283000.0, 1: 2808753700.0, 2: nan, 3: 2315123700.0, 4: nan},
'country_alpha3': {0: 'ESP', 1: 'ESP', 2: nan, 3: 'BEL', 4: nan},
'country_2': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'docdb_family_id': {0: 37137417,
1: 37137417,
2: 37137417,
3: 37137417,
4: 35462722},
'publn_nr_y': {0: 2693.0, 1: 2693.0, 2: 2693.0, 3: 2693.0, 4: 2715.0},
'cpc_class_interest': {0: 'Y02', 1: 'Y02', 2: 'Y02', 3: 'Y02', 4: 'Y02'}}
Specifically, doi is a string variable and pmid is a float.
Now, what I am trying to do (please feel free to suggest me a smarter way to merge the db since they are very large) is the following:
dd_papall_green = dd_papers_all.merge(
dd_green_papers_frontier,
how="inner",
on=["pmid", "doi"]
).persist()
but it fails with the error:
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
Hence, what I did is to convert both doi and pmid to float as:
dd_papers_all['pmid']=dd_papers_all['pmid'].astype(float)
dd_papers_all['doi']=dd_papers_all['doi'].astype(float)
dd_green_papers_frontier['pmid']=dd_green_papers_frontier['pmid'].astype(float)
dd_green_papers_frontier['doi']=dd_green_papers_frontier['doi'].astype(float)
but again the merge fails.
How can I perform the described merge?
I try and concat or append (neither are working) 2 9-column dataframes together. But, instead of just doing a normal vertical stacking of them, pandas keeps trying to add 9 more empty columns as well. Do you know how to stop this?
output looks like this:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,0,1,10,11,12,13,2,3,4,5,6,7,8,9
10/23/2020,New Castle,DE,Gary,IN,Full,Flatbed,0.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency ,,,,,,,,,,,,,,
10/22/2020,Wilmington,DE,METHUEN,MA,Full,Flatbed / Step Deck,0.00,48,48,0,Ken,(903) 280-7878,UrTruckBroker ,,,,,,,,,,,,,,
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,47,1,0,Dispatch,(912) 748-3801,DSV Road Inc. ,,,,,,,,,,,,,,
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,48,1,0,Dispatch,(541) 826-4786,Sureway Transportation Co / Anderson Trucking Serv ,,,,,,,,,,,,,,
10/30/2020,New Castle,DE,Gary,IN,Full,Flatbed,945.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency ,,,,,,,,,,,,,,
...
,,,,,,,,,,,,,,03/02/2021,Knapp,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Jackson,NE,Full,Flatbed / Step Deck,0.0,48.0,48.0
,,,,,,,,,,,,,,03/02/2021,Knapp,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Sterling,IL,Full,Flatbed / Step Deck,0.0,48.0,48.0
,,,,,,,,,,,,,,03/02/2021,Milwaukee,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Great Falls,MT,Full,Flatbed / Step Deck,0.0,45.0,48.0
,,,,,,,,,,,,,,03/02/2021,Algoma,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Pamplico,SC,Full,Flatbed / Step Deck,0.0,48.0,48.0
code is a web request to get data, which I save to dataframe, which is then concat-ed with another dataframe that comes from a CSV. I then save all of this back to that csv:
this_csv = 'freights_trulos.csv'
try:
old_df = pd.read_csv(this_csv)
except BaseException as e:
print(e)
old_df = pd.DataFrame()
state, equip = 'DE', 'Flat'
url = "https://backend-a.trulos.com/load-table/grab_loads.php?state=%s&equipment=%s" % (state, equip)
payload = {}
headers = {
...
}
response = requests.request("GET", url, headers=headers, data=payload)
# print(response.text)
parsed = json.loads(response.content)
data = [r[0:13] + [r[-4].split('<br/>')[-2].split('>')[-1]] for r in parsed]
df = pd.DataFrame(data=data)
if not old_df.empty:
# concatenate old and new and remove duplicates
# df.reset_index(drop=True, inplace=True)
# old_df.reset_index(drop=True, inplace=True)
# df = pd.concat([old_df, df], ignore_index=True) <--- CONCAT HAS SAME ISSUES AS APPEND
df = df.append(old_df, ignore_index=True)
# remove duplicates on cols
df.drop_duplicates()
df.to_csv(this_csv, index=False)
EDIT appended df's have had their types changed
df.dtypes
Out[2]:
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
10 object
11 object
12 object
13 object
dtype: object
old_df.dtypes
Out[3]:
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 float64
8 int64
9 int64
10 int64
11 object
12 object
13 object
dtype: object
old_df to csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13
10/23/2020,New Castle,DE,Gary,IN,Full,Flatbed,0.0,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
10/22/2020,Wilmington,DE,METHUEN,MA,Full,Flatbed / Step Deck,0.0,48,48,0,Ken,(903) 280-7878,UrTruckBroker
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.0,47,1,0,Dispatch,(912) 748-3801,DSV Road Inc.
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.0,48,1,0,Dispatch,(541) 826-4786,Sureway Transportation Co / Anderson Trucking Serv
10/30/2020,New Castle,DE,Gary,IN,Full,Flatbed,945.0,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
new_df to csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13
10/23/2020,New Castle,DE,Gary,IN,Full,Flatbed,0.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
10/22/2020,Wilmington,DE,METHUEN,MA,Full,Flatbed / Step Deck,0.00,48,48,0,Ken,(903) 280-7878,UrTruckBroker
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,47,1,0,Dispatch,(912) 748-3801,DSV Road Inc.
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,48,1,0,Dispatch,(541) 826-4786,Sureway Transportation Co / Anderson Trucking Serv
10/30/2020,New Castle,DE,Gary,IN,Full,Flatbed,945.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
I guess the problem could be how you read the data if I copy your sample data to excel and split by comma and then import to pandas, all is fine. Also if I split on comma AND whitespaces, I have +9 additional columns. So you could try debugging by replacing all whitespaces before creating your dataframe.
I also used your sample data and it workend just fine for me if I initialize it like this:
import pandas as pd
df_new = pd.DataFrame({'0': {0: '10/23/2020',
1: '10/22/2020',
2: '10/23/2020',
3: '10/23/2020',
4: '10/30/2020'},
'1': {0: 'New_Castle',
1: 'Wilmington',
2: 'WILMINGTON',
3: 'WILMINGTON',
4: 'New_Castle'},
'2': {0: 'DE', 1: 'DE', 2: 'DE', 3: 'DE', 4: 'DE'},
'3': {0: 'Gary', 1: 'METHUEN', 2: 'METHUEN', 3: 'METHUEN', 4: 'Gary'},
'4': {0: 'IN', 1: 'MA', 2: 'MA', 3: 'MA', 4: 'IN'},
'5': {0: 'Full', 1: 'Full', 2: 'Full', 3: 'Full', 4: 'Full'},
'6': {0: 'Flatbed',
1: 'Flatbed_/_Step_Deck',
2: 'Flatbed_w/Tarps',
3: 'Flatbed_w/Tarps',
4: 'Flatbed'},
'7': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 945.0},
'8': {0: 46, 1: 48, 2: 47, 3: 48, 4: 46},
'9': {0: 48, 1: 48, 2: 1, 3: 1, 4: 48},
'10': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'11': {0: 'Dispatch', 1: 'Ken', 2: 'Dispatch', 3: 'Dispatch', 4: 'Dispatch'},
'12': {0: '(800)_488-1860',
1: '(903)_280-7878',
2: '(912)_748-3801',
3: '(541)_826-4786',
4: '(800)_488-1860'},
'13': {0: 'Meadow_Lark_Agency_',
1: 'UrTruckBroker_',
2: 'DSV_Road_Inc._',
3: 'Sureway_Transportation_Co_/_Anderson_Trucking_Serv_',
4: 'Meadow_Lark_Agency_'}})
df_new = pd.DataFrame({'0': {0: '10/23/2020',
1: '10/22/2020',
2: '10/23/2020',
3: '10/23/2020',
4: '10/30/2020'},
'1': {0: 'New_Castle',
1: 'Wilmington',
2: 'WILMINGTON',
3: 'WILMINGTON',
4: 'New_Castle'},
'2': {0: 'DE', 1: 'DE', 2: 'DE', 3: 'DE', 4: 'DE'},
'3': {0: 'Gary', 1: 'METHUEN', 2: 'METHUEN', 3: 'METHUEN', 4: 'Gary'},
'4': {0: 'IN', 1: 'MA', 2: 'MA', 3: 'MA', 4: 'IN'},
'5': {0: 'Full', 1: 'Full', 2: 'Full', 3: 'Full', 4: 'Full'},
'6': {0: 'Flatbed',
1: 'Flatbed_/_Step_Deck',
2: 'Flatbed_w/Tarps',
3: 'Flatbed_w/Tarps',
4: 'Flatbed'},
'7': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 945.0},
'8': {0: 46, 1: 48, 2: 47, 3: 48, 4: 46},
'9': {0: 48, 1: 48, 2: 1, 3: 1, 4: 48},
'10': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'11': {0: 'Dispatch', 1: 'Ken', 2: 'Dispatch', 3: 'Dispatch', 4: 'Dispatch'},
'12': {0: '(800)_488-1860',
1: '(903)_280-7878',
2: '(912)_748-3801',
3: '(541)_826-4786',
4: '(800)_488-1860'},
'13': {0: 'Meadow_Lark_Agency_',
1: 'UrTruckBroker_',
2: 'DSV_Road_Inc._',
3: 'Sureway_Transportation_Co_/_Anderson_Trucking_Serv_',
4: 'Meadow_Lark_Agency_'}})
df_new.append(df_old, ignore_index=True)
#OR
pd.concat([df_new, df_old])
I am working with movie data and have a dataframe column for movie genre. Currently the column contains a list of movie genres for each movie (as most movies are assigned to multiple genres), but for the purpose of this analysis, I would like to parse the list and create a new dataframe column for each genre. So instead of having genre=['Drama','Thriller'] for a given movie, I would have two columns, something like genre1='Drama' and genre2='Thriller'.
Here is a snippet of my data:
{'color': {0: [u'Color::(Technicolor)'],
1: [u'Color::(Technicolor)'],
2: [u'Color::(Technicolor)'],
3: [u'Color::(Technicolor)'],
4: [u'Black and White']},
'country': {0: [u'USA'],
1: [u'USA'],
2: [u'USA'],
3: [u'USA', u'UK'],
4: [u'USA']},
'genre': {0: [u'Crime', u'Drama'],
1: [u'Crime', u'Drama'],
2: [u'Crime', u'Drama'],
3: [u'Action', u'Crime', u'Drama', u'Thriller'],
4: [u'Crime', u'Drama']},
'language': {0: [u'English'],
1: [u'English', u'Italian', u'Latin'],
2: [u'English', u'Italian', u'Spanish', u'Latin', u'Sicilian'],
3: [u'English', u'Mandarin'],
4: [u'English']},
'rating': {0: 9.3, 1: 9.2, 2: 9.0, 3: 9.0, 4: 8.9},
'runtime': {0: [u'142'],
1: [u'175'],
2: [u'202', u'220::(The Godfather Trilogy 1901-1980 VHS Special Edition)'],
3: [u'152'],
4: [u'96']},
'title': {0: u'The Shawshank Redemption',
1: u'The Godfather',
2: u'The Godfather: Part II',
3: u'The Dark Knight',
4: u'12 Angry Men'},
'votes': {0: 1793199, 1: 1224249, 2: 842044, 3: 1774083, 4: 484061},
'year': {0: 1994, 1: 1972, 2: 1974, 3: 2008, 4: 1957}}
Any help would be greatly appreciated! Thanks!
I think you need DataFrame constructor with add_prefix and last concat to original:
df1 = pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')
df = pd.concat([df.drop('genre',axis=1), df1], axis=1)
Timings:
df = pd.DataFrame(d)
print (df)
#5000 rows
df = pd.concat([df]*1000).reset_index(drop=True)
In [394]: %timeit (pd.concat([df.drop('genre',axis=1), pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')], axis=1))
100 loops, best of 3: 3.4 ms per loop
In [395]: %timeit (pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1))
1 loop, best of 3: 757 ms per loop
This should work for you:
pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1)