I'm writing a simple code to have a two-way table of distances between various cities.
Basically, I have a list of cities (say just 3: Paris, Berlin, London), and I created a combination between them with itertools (so I have Paris-Berlin, Paris-London, Berlin-London). I parsed the distances from a website and saved them in a dictionary (so I have: {Paris: {Berlin : 878.36, London : 343.67}, Berlin : {London : 932.14}}).
Now I want to create a two way table, so that I can look up for a pair of cities in Excel (I need it in Excel unfortunately, otherwise with Python all of this would be unnecessary!), and have the distance back. The table has to be complete (ie not triangular, so that I can look for London-Paris, or Paris-London, and the value must be there on both row/column pair). Is something like this possible easily? I was thinking probably I need to fill in my dictionary (ie create something like { Paris : {Berlin : 878.36, London 343.67}, Berlin : {Paris : 878.36, London : 932.14}, London : {Paris : 343.67, Berlin : 932.14}), and then feed it to Pandas, but not sure it's the fastest way. Thank you!
I think this does something like what you need:
import pandas as pd
data = {'Paris': {'Berlin': 878.36, 'London': 343.67}, 'Berlin': {'London': 932.14}}
# Create data frame from dict
df = pd.DataFrame(data)
# Rename index
df.index.name = 'From'
# Make index into a column
df = df.reset_index()
# Turn destination columns into rows
df = df.melt(id_vars='From', var_name='To', value_name='Distance')
# Drop missing values (distance to oneself)
df = df.dropna()
# Concatenate with itself but swapping the order of cities
df = pd.concat([df, df.rename(columns={'From' : 'To', 'To': 'From'})], sort=False)
# Reset index
df = df.reset_index(drop=True)
print(df)
Output:
From To Distance
0 Berlin Paris 878.36
1 London Paris 343.67
2 London Berlin 932.14
3 Paris Berlin 878.36
4 Paris London 343.67
5 Berlin London 932.14
Related
I have two dataframe df1 and df2.
df1 has 4 columns.
>df1
Neighborhood Street Begin Street End Street
8th Ave 6th St Church St Mlk blvd
.....
>df2
Intersection Roadway
Mlk blvd Hue St.
I want to add a new column Count in df2 in such a way that for every row in df2 if any string from Intersection or Roadway column exists in overall df1 data frame even once or more, the count column will have a value of 1. For example for this sample dataframe df2 as Mlk blvd is found in df1 under End Street column : the df2 will look like :
>df2
Intersection Roadway Count
Mlk blvd Hue St. 1
I also wanted to strip the string and make it case neutral to match it. However, I am not sure how would I set this matching logic using .iloc . How could I solve this?
Flatten the values in df1 and map to lower case, then convert the values in df2 to lower case and use isin + any to test for the match
vals = map(str.lower, df1.values.ravel())
df2['count'] = df2.applymap(str.lower).isin(vals).any(1).astype(int)
Intersection Roadway count
0 Mlk blvd Hue St. 1
A list of names and I want to retrieve each of the correspondent information in different data frames, to form a new dataframe.
I converted the list into a 1 column dataframe, then to look up its corresponding values in different dataframes.
The idea is visualized as:
I have tried:
import pandas as pd
data = {'Name': ["David","Mike","Lucy"]}
data_h = {'Name': ["David","Mike","Peter", "Lucy"],
'Hobby': ['Music','Sports','Cooking','Reading'],
'Member': ['Yes','Yes','Yes','No']}
data_s = {'Name': ["David","Lancy", "Mike","Lucy"],
'Speed': [56, 42, 35, 66],
'Location': ['East','East','West','West']}
df = pd.DataFrame(data)
df_hobby = pd.DataFrame(data_h)
df_speed = pd.DataFrame(data_s)
df['Hobby'] = df.lookup(df['Name'], df_hobby['Hobby'])
print (df)
But it returns the error message as:
ValueError: Row labels must have same size as column labels
I have also tried:
df = pd.merge(df, df_hobby, on='Name')
It works but it includes unnecessary columns.
What will be the smart an efficient way to do such, especially when the number of to-be-looked-up dataframes are many?
Thank you.
Filter only columns for merge and columns for append like:
df = (pd.merge(df, df_hobby[['Name','Hobby']], on='Name')
.merge(df_speed[['Name','Location']], on='Name'))
print(df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
If want working with list use this solution with filtering columns:
dfList = [df,
df_hobby[['Name','Hobby']],
df_speed[['Name','Location']]]
from functools import reduce
df = reduce(lambda df1,df2: pd.merge(df1,df2,on='Name'), dfList)
print (df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
I have a column like following in my dataframe
What I want is to create a new column based on the language column, like country, If someone's language is "eng", the country column should fill with UK
Desired output
NB: This is a sample I created in excel, I am working with pandas in jupyter notebook
Considering this to be your df:
In [1359]: df = pd.DataFrame({'driver':['Hamilton', 'Sainz', 'Giovanazi'], 'language':['eng', 'spa', 'ita']})
In [1360]: df
Out[1360]:
driver language
0 Hamilton eng
1 Sainz spa
2 Giovanazi ita
And this to be your language-country mapping:
In [1361]: mapping = {'eng': 'UK', 'spa': 'Spain', 'ita': 'Italy'}
You can use df.map to solve it:
In [1363]: df['country'] = df.language.map(mapping)
In [1364]: df
Out[1364]:
driver language country
0 Hamilton eng UK
1 Sainz spa Spain
2 Giovanazi ita Italy
I have produced some data which lists parks in proximity to different areas of East London with use of the FourSquare API. It here in the dataframe, df.
Location,Parks,Borough
Aldborough Hatch,Fairlop Waters Country Park,Redbridge
Ardleigh Green,Haynes Park,Havering
Bethnal Green,"Haggerston Park, Weavers Fields",Tower Hamlets
Bromley-by-Bow,"Rounton Park, Grove Hall Park",Tower Hamlets
Cambridge Heath,"Haggerston Park, London Fields",Tower Hamlets
Dalston,"Haggerston Park, London Fields",Hackney
Import data with df = pd.read_clipboard(sep=',')
What I would like to do is group by the borough column and count the distinct parks in that borough so that for example 'Tower Hamlets' = 5 and 'Hackney' = 2. I will create a new dataframe for this purpose which simply lists total number of parks for each borough present in the dataframe.
I know I can do:
df.groupby(['Borough', 'Parks']).size()
But I need to split parks by the delimiter ',' such that they are treated as unique, distinct entities for a borough.
What do you suggest?
Thanks!
The first rule of data science is to clean your data into a useful format.
Reformat the DataFrame to be usable:
df.Parks = df.Parks.str.split(',\s*') # per user piRSquared
df = df.explode('Parks') # pandas v 0.25
Now the DataFrame is in a proper format that can be more easily analyzed
df.groupby('Borough').Parks.nunique()
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
That's three lines of code, but now the DataFrame is in a useful format, upon which more insights can easily be extracted.
Plot
df.groupby(['Borough']).Parks.nunique().plot(kind='bar', title='Unique Parks Counts by Borough')
If you are using Pandas 0.25 or greater, consider the answer from Trenton_M
His answer provides a good suggestion for creating a more useful data set.
IIUC:
df.groupby('Borough').Parks.apply(
lambda s: len(set(', '.join(s).split(', ')))
)
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
Name: Parks, dtype: int64
Similar
df.Parks.str.split(', ').groupby(df.Borough).apply(lambda s: len(set().union(*s)))
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
Name: Parks, dtype: int64
How do I convert this dataframe
location value
0 (Richmond, Virginia, nan, USA) 100
1 (New York City, New York, nan, USA) 200
to this:
city state region country value
0 Richmond Virginia nan USA 100
1 New York City New York nan USA 200
Note that the location column in the first dataframe contains tuples. I want to create four columns out of the location column.
new_col_list = ['city','state','regions','country']
for n,col in enumerate(new_col_list):
df[col] = df['location'].apply(lambda location: location[n])
df = df.drop('location',axis=1)
If you return a Series of the (split) location, you can merge (join to merge on index) the resulting DF directly with your value column.
addr = ['city', 'state', 'region', 'country']
df[['value']].join(df.location.apply(lambda loc: Series(loc, index=addr)))
value city state region country
0 100 Richmond Virginia NaN USA
1 200 New York City New York NaN USA
I haven't timed this, but I would suggest this option:
df.loc[:,'city']=df.location.map(lambda x:x[0])
df.loc[:,'state']=df.location.map(lambda x:x[1])
df.loc[:,'regions']=df.location.map(lambda x:x[2])
df.loc[:,'country']=df.location.map(lambda x:x[3])
I'm guessing avoiding explicit for loop might lend itself to a SIMD instruction (certainly numpy looks for that, but perhaps not other libraries)
I prefer to use pd.DataFrame.from_records to convert the tuples to Series. Then this can be joined to the previous dataset as described by meloncholy.
df = pd.DataFrame({"location":[("Richmond", "Virginia", pd.NA, "USA"),
("New York City", "New York", pd.NA, "USA")],
"value": [100,200]})
loc = pd.DataFrame.from_records(df.location, columns=['city','state','regions','country'])
df.drop("location", axis=1).join(loc)
from_records does assume a sequential index. If this is not the case you should pass the index to the new DataFrame:
loc = pd.DataFrame.from_records(df.location.reset_index(drop=True),
columns=['city','state','regions','country'],
index=df.index)