Merging Bokeh USCounties sample data with column values from my own dataframe - python

I'm using Python 3 and currently playing with the latest version of Bokeh.
I've imported everything necessary, but I'm a little bit stuck with a single (I hope) line of code.
I'm using the US County sample data. I want to hover over the map and for it to display the vote percentages for each respective county, as they're hovered over with the cursor.
I've searched on here for other bokeh examples and explicitly the US County data but I can only seem to find questions regarding issues with map shape.
from bokeh.models import LogColorMapper
from bokeh.palettes import Viridis6 as palette
from bokeh.sampledata.us_counties import data as counties
palette = tuple(reversed(palette))
color_mapper = LogColorMapper(palette=palette)
counties = {
code: county for code, county in counties.items() if county['state'] == 'tx'
}
county_xs = [county['lons'] for county in counties.values()]
county_ys = [county['lats'] for county in counties.values()]
county_names = [county['name'] for county in counties.values()]
## Below is the variable I wish to create, and these are the columns and dataframe of importance.
#county_vote_total =
#texasJbFinal['County Vote Percentage'] - where the vote percentages are
#texasJbFinal['County'] - What my own df county column is labelled as.
data = dict(
x=county_xs,
y=county_ys,
name=county_names,
voteP=county_vote_total
)
TOOLS = "pan,wheel_zoom,reset,hover,save"
p = figure(
title='Joe Biden Texas Vote Percentage',
tools=TOOLS,
x_axis_location=None, y_axis_location=None,
tooltips=[
("Name", "#name"), ("Vote Percentage", "#voteP"), ("Long, lat", "($x, $y)")
]
)
p.grid.grid_line_color=None
p.hover.point_policy = "follow_mouse"
p.patches("x", "y", source=data, fill_color={"field": "voteP", "transform": color_mapper},
fill_alpha=0.6, line_color="black", line_width=0.5)
show(p)
I have tried a few things but I can't seem to figure out how to match up each individual county from my texasJbFinal dataframe with the bokeh.sampledata.us_counties and then display the vote percentage as each is hovered over.
Here is a sample of my DF, using texasJbFinal.head(5).to_dict()
{'State': {0: 'Texas', 1: 'Texas', 2: 'Texas', 3: 'Texas', 4: 'Texas'},
'County': {0: 'Roberts County',
1: 'Borden County',
2: 'King County',
3: 'Glasscock County',
4: 'Armstrong County'},
'Candidate': {0: 'Joe Biden',
1: 'Joe Biden',
2: 'Joe Biden',
3: 'Joe Biden',
4: 'Joe Biden'},
'Total Votes': {0: 17, 1: 16, 2: 8, 3: 39, 4: 75},
'County Vote Percentage': {0: 3.091, 1: 3.846, 2: 5.031, 3: 5.972, 4: 6.745},
'Total Population': {0: 912, 1: 697, 2: 315, 3: 2171, 4: 2122},
'White Alone': {0: 782, 1: 598, 2: 234, 3: 1003, 4: 1833},
'White Alone Percent': {0: 85.74561403508771,
1: 85.79626972740316,
2: 74.28571428571428,
3: 46.19990787655458,
4: 86.38077285579642},
'Black or African American Alone': {0: 0, 1: 0, 2: 0, 3: 0, 4: 5},
'Black or African American Alone Percent': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.23562676720075398},
'American Indian and Alaska Native Alone': {0: 0, 1: 0, 2: 0, 3: 0, 4: 22},
'Asian Alone': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Native Hawaiian and Other Pacific Islander Alone': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0},
'Some other race Alone': {0: 0, 1: 0, 2: 3, 3: 507, 4: 42},
'Two or more races': {0: 23, 1: 15, 2: 0, 3: 0, 4: 71},
'Hispanic or Latino Alone': {0: 107, 1: 84, 2: 78, 3: 661, 4: 149},
'Hispanic or Latino Alone Percent': {0: 11.732456140350877,
1: 12.051649928263988,
2: 24.76190476190476,
3: 30.446798710271764,
4: 7.021677662582469}}

Here's how I'd tackle it:
Turn the Bokeh counties data into a DataFrame to merge with your existing df. Something like:
bokeh_counties = pd.DataFrame.from_records([county for key, county in counties.items()])
...and then you'd have to do some regex matching or other text manipulation to merge, since your values are all appended with " County" and those in the Bokeh dataset are not.
Once you've got the merged DataFrame with all the data you need, convert to a ColumnDataSource for use by the Bokeh glyphs and hovertool. While CDSes aren't absolutely required for a lot of Bokeh tasks, they tend to make things much easier.

Thanks for the help. I didn't quite go your route, but it gave me inspiration to solve my issue.
I turned the counties dictionary into a dataframe, done a little text manipulation, merged with my original pandas dataframe, turned it all back into one dictionary and everything became very simple after that.
Thanks again for the great answer :)

Related

Multiindex merge in dask

I am trying to inner merge two dask dataframes based on two ids namely doi and pmid.
The datasets look like this (only the head, feel free to modify the doi and pmid to construct a MWE):
dd_papers_all:
{'cited_by_count': {0: nan, 1: nan, 2: 9.0, 3: nan, 4: 30.0},
'cited_by_url': {0: 'None',
1: 'None',
2: "['W1968224982', 'W1977435724', 'W2003814720', 'W2006453929', 'W2015063028', 'W2139614344']",
3: 'None',
4: "['W181218938', 'W1969520123', 'W1970043627', 'W1977525191', 'W2006834484', 'W2057850214', 'W2062554850', 'W2070252209', 'W2098074569', 'W2123616561', 'W2150154625', 'W2408116868']"},
'authors': {0: "['Walczak M', 'Pawlaczyk J']",
1: "['Ioan Oliver Avram']",
2: "['T.M. Dmitrieva', 'T.P. Eremeeva', 'G.I. Alatortseva', 'Vadim I. Agol']",
3: "['Djurdjina Ružić', 'Tatjana Vujović', 'Gabriela Libiaková', 'Radosav Cerović', 'Alena Gajdošová']",
4: "['M. Harris']"},
'institutions': {0: '[]',
1: '[]',
2: "['Lomonosov Moscow State University', 'USSR Academy of Medical Sciences', 'Lomonosov Moscow State University', 'USSR Academy of Medical Sciences']",
3: "['Fruit Research Institute', 'Fruit Research Institute', 'Institute of Plant Genetics and Biotechnology', 'Fruit Research Institute', 'Institute of Plant Genetics and Biotechnology']",
4: "['Durban University of Technology']"},
'paper_id': {0: 'W155261221',
1: 'W145424619',
2: 'W1482328891',
3: 'W1581876373',
4: 'W1978891149'},
'pub_year_col': {0: 1969, 1: 2010, 2: 1980, 3: 2012, 4: 2008},
'level_0': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'level_1': {0: "['Ophthalmology', 'Pediatrics', 'Internal medicine', 'Endocrinology', 'Anatomy', 'Developmental psychology']",
1: "['Risk analysis (engineering)', 'Manufacturing engineering', 'Industrial engineering', 'Operations research', 'Mechanical engineering', 'Epistemology', 'Climatology', 'Macroeconomics', 'Operating system']",
2: "['Virology', 'Cell biology', 'Biochemistry', 'Quantum mechanics']",
3: "['Horticulture', 'Botany', 'Biochemistry']",
4: "['Pedagogy', 'Mathematics education', 'Paleontology', 'Social science', 'Neuroscience', 'Programming language', 'Quantum mechanics']"},
'level_2': {0: "['Craniopharyngioma', 'Fundus (uterus)', 'Girl', 'Ventricle', 'Fourth ventricle']",
1: "['Process (computing)', 'Production (economics)', 'Machine tool', 'Quality (philosophy)', 'Productivity', 'Multiple-criteria decision analysis', 'Machining', 'Forcing (mathematics)']",
2: "['Mechanism (biology)', 'Replication (statistics)', 'Virus', 'Gene']",
3: "['Vaccinium', 'In vitro']",
4: "['Negotiation', 'Reflective practice', 'Construct (python library)', 'Action research', 'Context (archaeology)', 'Reflective writing', 'Perception', 'Power (physics)']"},
'doi': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'pmid': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'mag': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
and dd_green_papers_frontier:
{'paperid': {0: 2006817976,
1: 2006817976,
2: 1972698438,
3: 1968223008,
4: 2149313415},
'uspto': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'doi': {0: '10.1016/J.AB.2003.07.015',
1: '10.1016/J.AB.2003.07.015',
2: '10.1007/S002170100404',
3: '10.1007/S002170100336',
4: '10.3324/%X'},
'pmid': {0: 14656521.0, 1: 14656521.0, 2: nan, 3: nan, 4: 12414351.0},
'publn_nr_x': {0: 2693, 1: 2693, 2: 2693, 3: 2693, 4: 2715},
'paperyear': {0: 2003.0, 1: 2003.0, 2: 2001.0, 3: 2001.0, 4: 2002.0},
'papertitle': {0: 'development of melting temperature based sybr green i polymerase chain reaction methods for multiplex genetically modified organism detection',
1: 'development of melting temperature based sybr green i polymerase chain reaction methods for multiplex genetically modified organism detection',
2: 'event specific detection of roundup ready soya using two different real time pcr detection chemistries',
3: 'characterisation of the roundup ready soybean insert',
4: 'hepatitis c virus infection in a hematology ward evidence for nosocomial transmission and impact on hematologic disease outcome'},
'magfieldid': {0: 149629.0,
1: 149629.0,
2: 143660080.0,
3: 40767140.0,
4: 2780572000.0},
'oecd_field': {0: '2. Engineering and Technology',
1: '2. Engineering and Technology',
2: '1. Natural Sciences',
3: '1. Natural Sciences',
4: '3. Medical and Health Sciences'},
'oecd_subfield': {0: '2.11 Other engineering and technologies',
1: '2.11 Other engineering and technologies',
2: '1.06 Biological sciences',
3: '1.06 Biological sciences',
4: '3.02 Clinical medicine'},
'wosfield': {0: 'Food Science & Technology',
1: 'Food Science & Technology',
2: 'Biochemical Research Methods',
3: 'Biochemistry & Molecular Biology',
4: 'Hematology'},
'author': {0: 2083283000.0, 1: 2808753700.0, 2: nan, 3: 2315123700.0, 4: nan},
'country_alpha3': {0: 'ESP', 1: 'ESP', 2: nan, 3: 'BEL', 4: nan},
'country_2': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'docdb_family_id': {0: 37137417,
1: 37137417,
2: 37137417,
3: 37137417,
4: 35462722},
'publn_nr_y': {0: 2693.0, 1: 2693.0, 2: 2693.0, 3: 2693.0, 4: 2715.0},
'cpc_class_interest': {0: 'Y02', 1: 'Y02', 2: 'Y02', 3: 'Y02', 4: 'Y02'}}
Specifically, doi is a string variable and pmid is a float.
Now, what I am trying to do (please feel free to suggest me a smarter way to merge the db since they are very large) is the following:
dd_papall_green = dd_papers_all.merge(
dd_green_papers_frontier,
how="inner",
on=["pmid", "doi"]
).persist()
but it fails with the error:
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
Hence, what I did is to convert both doi and pmid to float as:
dd_papers_all['pmid']=dd_papers_all['pmid'].astype(float)
dd_papers_all['doi']=dd_papers_all['doi'].astype(float)
dd_green_papers_frontier['pmid']=dd_green_papers_frontier['pmid'].astype(float)
dd_green_papers_frontier['doi']=dd_green_papers_frontier['doi'].astype(float)
but again the merge fails.
How can I perform the described merge?

Custom function to replace missing values in dataframe with median located in pivot table

I am attempting to write a function to replace missing values in the 'total_income' column with the median 'total_income' provided by the pivot table, using the row's 'education' and 'income_type' to index the pivot table. I want to populate using these medians so that the values are as optimal as they can be. Here is what I am testing:
This is the first 5 rows of the dataframe as a dictionary:
{'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'children': {0: 1, 1: 1, 2: 0, 3: 3, 4: 0},
'days_employed': {0: 8437.673027760233,
1: 4024.803753850451,
2: 5623.422610230956,
3: 4124.747206540018,
4: 340266.07204682194},
'dob_years': {0: 42, 1: 36, 2: 33, 3: 32, 4: 53},
'education': {0: "bachelor's degree",
1: 'secondary education',
2: 'secondary education',
3: 'secondary education',
4: 'secondary education'},
'education_id': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1},
'family_status': {0: 'married',
1: 'married',
2: 'married',
3: 'married',
4: 'civil partnership'},
'family_status_id': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1},
'gender': {0: 'F', 1: 'F', 2: 'M', 3: 'M', 4: 'F'},
'income_type': {0: 'employee',
1: 'employee',
2: 'employee',
3: 'employee',
4: 'retiree'},
'debt': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'total_income': {0: 40620.102,
1: 17932.802,
2: 23341.752,
3: 42820.568,
4: 25378.572},
'purpose': {0: 'purchase of the house',
1: 'car purchase',
2: 'purchase of the house',
3: 'supplementary education',
4: 'to have a wedding'},
'age_group': {0: 'adult',
1: 'adult',
2: 'adult',
3: 'adult',
4: 'older adult'}}
def fill_income(row):
total_income = row['total_income']
age_group = row['age_group']
income_type = row['income_type']
education = row['education']
table = df.pivot_table(index=['age_group','income_type' ], columns='education', values='total_income', aggfunc='median')
if total_income == 'NaN':
if age_group =='adult':
return table.loc[education, income_type]
My desired output is the pivot table value (the median total_income) for the dataframe row's given education and income_type. When I test it, it returns 'None'.
Thanks in advance for your time helping me with this problem!

how to aggregate columns based on the value of others

If i had a dataframe such as this, how would i create aggragtes such as min,max and mean for each Port for each given year?
df1 = pd.DataFrame({'Year': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4:2019},'Port': {0: 'NORTH SHIELDS', 1: 'NORTH SHIELDS' 2: 'NORTH SHIELDS', 3: 'NORTH SHIELDS', 4: 'NORTH SHIELDS'},'Vessel capacity units': {0: 760.5, 1: 760.5, 2: 760.5, 3: 760.5, 4: 760.5},'Engine power': {0: 790.0, 1: 790.0, 2: 790.0, 3: 790.0, 4: 790.0},'Registered tonnage': {0: 516.0, 1: 516.0, 2: 516.0, 3: 516.0, 4: 516.0},'Overall length': {0: 45.0, 1: 45.0, 2: 45.0, 3: 45.0, 4: 45.0},'Value(£)': {0: 2675.81, 1: 62.98, 2: 9.67, 3: 527.02, 4: 2079.0}, 'Landed Weight (tonnes)': {0: 0.978,1: 0.0135, 2: 0.001, 3: 0.3198, 4: 3.832}})
df1
IIUC
df.groupby(['PORT', 'YEAR'])['<WHATEVER COLUMN HERE>'].agg(['count', 'min', 'max', 'mean']) #groupys by 'PORT', 'YEAR' and finds the multiple arguments of count, min, max, and mean
Without any kind of background information this questions is tricky. Would you want it for every year or just some given years?
To extract min/max/mean etc is quite straightforward. I assume that you have some kind of datafile and have extracted a df from there:
file = 'my-data.csv' # the data file
df = pd.read_csv(file)
VALUE_I_WANT_TO_EXTRAXT = df['Column name']
Then for each port you can extract the min/max/mean data like this.
for i in range(Port):
print( i, np.min(VALUE_I_WANT_TO_EXTRAXT) )
But, as I said. Without any kind of specifik knowledge about the problem it is hard to provide a solution

I can't get pandas to union my dataframes properly

I try and concat or append (neither are working) 2 9-column dataframes together. But, instead of just doing a normal vertical stacking of them, pandas keeps trying to add 9 more empty columns as well. Do you know how to stop this?
output looks like this:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,0,1,10,11,12,13,2,3,4,5,6,7,8,9
10/23/2020,New Castle,DE,Gary,IN,Full,Flatbed,0.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency ,,,,,,,,,,,,,,
10/22/2020,Wilmington,DE,METHUEN,MA,Full,Flatbed / Step Deck,0.00,48,48,0,Ken,(903) 280-7878,UrTruckBroker ,,,,,,,,,,,,,,
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,47,1,0,Dispatch,(912) 748-3801,DSV Road Inc. ,,,,,,,,,,,,,,
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,48,1,0,Dispatch,(541) 826-4786,Sureway Transportation Co / Anderson Trucking Serv ,,,,,,,,,,,,,,
10/30/2020,New Castle,DE,Gary,IN,Full,Flatbed,945.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency ,,,,,,,,,,,,,,
...
,,,,,,,,,,,,,,03/02/2021,Knapp,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Jackson,NE,Full,Flatbed / Step Deck,0.0,48.0,48.0
,,,,,,,,,,,,,,03/02/2021,Knapp,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Sterling,IL,Full,Flatbed / Step Deck,0.0,48.0,48.0
,,,,,,,,,,,,,,03/02/2021,Milwaukee,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Great Falls,MT,Full,Flatbed / Step Deck,0.0,45.0,48.0
,,,,,,,,,,,,,,03/02/2021,Algoma,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Pamplico,SC,Full,Flatbed / Step Deck,0.0,48.0,48.0
code is a web request to get data, which I save to dataframe, which is then concat-ed with another dataframe that comes from a CSV. I then save all of this back to that csv:
this_csv = 'freights_trulos.csv'
try:
old_df = pd.read_csv(this_csv)
except BaseException as e:
print(e)
old_df = pd.DataFrame()
state, equip = 'DE', 'Flat'
url = "https://backend-a.trulos.com/load-table/grab_loads.php?state=%s&equipment=%s" % (state, equip)
payload = {}
headers = {
...
}
response = requests.request("GET", url, headers=headers, data=payload)
# print(response.text)
parsed = json.loads(response.content)
data = [r[0:13] + [r[-4].split('<br/>')[-2].split('>')[-1]] for r in parsed]
df = pd.DataFrame(data=data)
if not old_df.empty:
# concatenate old and new and remove duplicates
# df.reset_index(drop=True, inplace=True)
# old_df.reset_index(drop=True, inplace=True)
# df = pd.concat([old_df, df], ignore_index=True) <--- CONCAT HAS SAME ISSUES AS APPEND
df = df.append(old_df, ignore_index=True)
# remove duplicates on cols
df.drop_duplicates()
df.to_csv(this_csv, index=False)
EDIT appended df's have had their types changed
df.dtypes
Out[2]:
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
10 object
11 object
12 object
13 object
dtype: object
old_df.dtypes
Out[3]:
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 float64
8 int64
9 int64
10 int64
11 object
12 object
13 object
dtype: object
old_df to csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13
10/23/2020,New Castle,DE,Gary,IN,Full,Flatbed,0.0,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
10/22/2020,Wilmington,DE,METHUEN,MA,Full,Flatbed / Step Deck,0.0,48,48,0,Ken,(903) 280-7878,UrTruckBroker
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.0,47,1,0,Dispatch,(912) 748-3801,DSV Road Inc.
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.0,48,1,0,Dispatch,(541) 826-4786,Sureway Transportation Co / Anderson Trucking Serv
10/30/2020,New Castle,DE,Gary,IN,Full,Flatbed,945.0,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
new_df to csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13
10/23/2020,New Castle,DE,Gary,IN,Full,Flatbed,0.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
10/22/2020,Wilmington,DE,METHUEN,MA,Full,Flatbed / Step Deck,0.00,48,48,0,Ken,(903) 280-7878,UrTruckBroker
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,47,1,0,Dispatch,(912) 748-3801,DSV Road Inc.
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,48,1,0,Dispatch,(541) 826-4786,Sureway Transportation Co / Anderson Trucking Serv
10/30/2020,New Castle,DE,Gary,IN,Full,Flatbed,945.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
I guess the problem could be how you read the data if I copy your sample data to excel and split by comma and then import to pandas, all is fine. Also if I split on comma AND whitespaces, I have +9 additional columns. So you could try debugging by replacing all whitespaces before creating your dataframe.
I also used your sample data and it workend just fine for me if I initialize it like this:
import pandas as pd
df_new = pd.DataFrame({'0': {0: '10/23/2020',
1: '10/22/2020',
2: '10/23/2020',
3: '10/23/2020',
4: '10/30/2020'},
'1': {0: 'New_Castle',
1: 'Wilmington',
2: 'WILMINGTON',
3: 'WILMINGTON',
4: 'New_Castle'},
'2': {0: 'DE', 1: 'DE', 2: 'DE', 3: 'DE', 4: 'DE'},
'3': {0: 'Gary', 1: 'METHUEN', 2: 'METHUEN', 3: 'METHUEN', 4: 'Gary'},
'4': {0: 'IN', 1: 'MA', 2: 'MA', 3: 'MA', 4: 'IN'},
'5': {0: 'Full', 1: 'Full', 2: 'Full', 3: 'Full', 4: 'Full'},
'6': {0: 'Flatbed',
1: 'Flatbed_/_Step_Deck',
2: 'Flatbed_w/Tarps',
3: 'Flatbed_w/Tarps',
4: 'Flatbed'},
'7': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 945.0},
'8': {0: 46, 1: 48, 2: 47, 3: 48, 4: 46},
'9': {0: 48, 1: 48, 2: 1, 3: 1, 4: 48},
'10': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'11': {0: 'Dispatch', 1: 'Ken', 2: 'Dispatch', 3: 'Dispatch', 4: 'Dispatch'},
'12': {0: '(800)_488-1860',
1: '(903)_280-7878',
2: '(912)_748-3801',
3: '(541)_826-4786',
4: '(800)_488-1860'},
'13': {0: 'Meadow_Lark_Agency_',
1: 'UrTruckBroker_',
2: 'DSV_Road_Inc._',
3: 'Sureway_Transportation_Co_/_Anderson_Trucking_Serv_',
4: 'Meadow_Lark_Agency_'}})
df_new = pd.DataFrame({'0': {0: '10/23/2020',
1: '10/22/2020',
2: '10/23/2020',
3: '10/23/2020',
4: '10/30/2020'},
'1': {0: 'New_Castle',
1: 'Wilmington',
2: 'WILMINGTON',
3: 'WILMINGTON',
4: 'New_Castle'},
'2': {0: 'DE', 1: 'DE', 2: 'DE', 3: 'DE', 4: 'DE'},
'3': {0: 'Gary', 1: 'METHUEN', 2: 'METHUEN', 3: 'METHUEN', 4: 'Gary'},
'4': {0: 'IN', 1: 'MA', 2: 'MA', 3: 'MA', 4: 'IN'},
'5': {0: 'Full', 1: 'Full', 2: 'Full', 3: 'Full', 4: 'Full'},
'6': {0: 'Flatbed',
1: 'Flatbed_/_Step_Deck',
2: 'Flatbed_w/Tarps',
3: 'Flatbed_w/Tarps',
4: 'Flatbed'},
'7': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 945.0},
'8': {0: 46, 1: 48, 2: 47, 3: 48, 4: 46},
'9': {0: 48, 1: 48, 2: 1, 3: 1, 4: 48},
'10': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'11': {0: 'Dispatch', 1: 'Ken', 2: 'Dispatch', 3: 'Dispatch', 4: 'Dispatch'},
'12': {0: '(800)_488-1860',
1: '(903)_280-7878',
2: '(912)_748-3801',
3: '(541)_826-4786',
4: '(800)_488-1860'},
'13': {0: 'Meadow_Lark_Agency_',
1: 'UrTruckBroker_',
2: 'DSV_Road_Inc._',
3: 'Sureway_Transportation_Co_/_Anderson_Trucking_Serv_',
4: 'Meadow_Lark_Agency_'}})
df_new.append(df_old, ignore_index=True)
#OR
pd.concat([df_new, df_old])

Split list in Pandas dataframe column into multiple columns

I am working with movie data and have a dataframe column for movie genre. Currently the column contains a list of movie genres for each movie (as most movies are assigned to multiple genres), but for the purpose of this analysis, I would like to parse the list and create a new dataframe column for each genre. So instead of having genre=['Drama','Thriller'] for a given movie, I would have two columns, something like genre1='Drama' and genre2='Thriller'.
Here is a snippet of my data:
{'color': {0: [u'Color::(Technicolor)'],
1: [u'Color::(Technicolor)'],
2: [u'Color::(Technicolor)'],
3: [u'Color::(Technicolor)'],
4: [u'Black and White']},
'country': {0: [u'USA'],
1: [u'USA'],
2: [u'USA'],
3: [u'USA', u'UK'],
4: [u'USA']},
'genre': {0: [u'Crime', u'Drama'],
1: [u'Crime', u'Drama'],
2: [u'Crime', u'Drama'],
3: [u'Action', u'Crime', u'Drama', u'Thriller'],
4: [u'Crime', u'Drama']},
'language': {0: [u'English'],
1: [u'English', u'Italian', u'Latin'],
2: [u'English', u'Italian', u'Spanish', u'Latin', u'Sicilian'],
3: [u'English', u'Mandarin'],
4: [u'English']},
'rating': {0: 9.3, 1: 9.2, 2: 9.0, 3: 9.0, 4: 8.9},
'runtime': {0: [u'142'],
1: [u'175'],
2: [u'202', u'220::(The Godfather Trilogy 1901-1980 VHS Special Edition)'],
3: [u'152'],
4: [u'96']},
'title': {0: u'The Shawshank Redemption',
1: u'The Godfather',
2: u'The Godfather: Part II',
3: u'The Dark Knight',
4: u'12 Angry Men'},
'votes': {0: 1793199, 1: 1224249, 2: 842044, 3: 1774083, 4: 484061},
'year': {0: 1994, 1: 1972, 2: 1974, 3: 2008, 4: 1957}}
Any help would be greatly appreciated! Thanks!
I think you need DataFrame constructor with add_prefix and last concat to original:
df1 = pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')
df = pd.concat([df.drop('genre',axis=1), df1], axis=1)
Timings:
df = pd.DataFrame(d)
print (df)
#5000 rows
df = pd.concat([df]*1000).reset_index(drop=True)
In [394]: %timeit (pd.concat([df.drop('genre',axis=1), pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')], axis=1))
100 loops, best of 3: 3.4 ms per loop
In [395]: %timeit (pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1))
1 loop, best of 3: 757 ms per loop
This should work for you:
pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1)

Categories