How do I convert this dataframe
location value
0 (Richmond, Virginia, nan, USA) 100
1 (New York City, New York, nan, USA) 200
to this:
city state region country value
0 Richmond Virginia nan USA 100
1 New York City New York nan USA 200
Note that the location column in the first dataframe contains tuples. I want to create four columns out of the location column.
new_col_list = ['city','state','regions','country']
for n,col in enumerate(new_col_list):
df[col] = df['location'].apply(lambda location: location[n])
df = df.drop('location',axis=1)
If you return a Series of the (split) location, you can merge (join to merge on index) the resulting DF directly with your value column.
addr = ['city', 'state', 'region', 'country']
df[['value']].join(df.location.apply(lambda loc: Series(loc, index=addr)))
value city state region country
0 100 Richmond Virginia NaN USA
1 200 New York City New York NaN USA
I haven't timed this, but I would suggest this option:
df.loc[:,'city']=df.location.map(lambda x:x[0])
df.loc[:,'state']=df.location.map(lambda x:x[1])
df.loc[:,'regions']=df.location.map(lambda x:x[2])
df.loc[:,'country']=df.location.map(lambda x:x[3])
I'm guessing avoiding explicit for loop might lend itself to a SIMD instruction (certainly numpy looks for that, but perhaps not other libraries)
I prefer to use pd.DataFrame.from_records to convert the tuples to Series. Then this can be joined to the previous dataset as described by meloncholy.
df = pd.DataFrame({"location":[("Richmond", "Virginia", pd.NA, "USA"),
("New York City", "New York", pd.NA, "USA")],
"value": [100,200]})
loc = pd.DataFrame.from_records(df.location, columns=['city','state','regions','country'])
df.drop("location", axis=1).join(loc)
from_records does assume a sequential index. If this is not the case you should pass the index to the new DataFrame:
loc = pd.DataFrame.from_records(df.location.reset_index(drop=True),
columns=['city','state','regions','country'],
index=df.index)
Related
I have a dataframe with some recurring values in one column. I want to group by that column and sum the other columns. The dataframe looks like this:
Edit: here is the code to create the dataframe. Notice the column called 'Able' which is the index.
df=pd.DataFrame({'Able': ['Blue', 'Green', 'Red', 'Orange'], 'Baker':[ 'New York', 'New Jersey', 'New York', 'New Jersey'], 'Charlie':[3,4,'',7], 'Delta':['',5,6,''],'Echo':[100,200,300,400]}).set_index('Able')
The result should group on 'Baker' and sum the other three columns. I've tried various flavors of groupby and pivot_table. They return the correct two rows (New York and New Jersey) but they only return 'Baker' and the sum for the rightmost column, 'Echo.' The far left column 'Able' which is the index for the source dataframe should be ignored. My output should look like this (edited thanks to #corralien for spotting an error):
Baker Charlie Delta Echo
New Jersey 11 5 600
New York 3 6 400
How do I return all the columns, ideally without listing them by name in the code?
Replace the space with 0 and agg sum. This will depend on what dype, the last three columns are. I repoduced df for you, feel free to edit if I got the dtypes wrong and edit the question. The forum will guide you.
Dataframe
df=pd.DataFrame({'Baker':[ 'New York', 'New Jersey', 'New York', 'New Jersey'], 'Charlie':[3,4,'',7], 'Delta':['',5,6,''],'Echo':[100,200,300,400]})
Code
df.replace('',0).groupby('Baker').agg('sum')
Output
Charlie Delta Echo
Baker
New Jersey 11 5 600
New York 3 6 400
Use pivot_table:
>>> df.pivot_table(index='Baker', values=['Charlie', 'Delta', 'Echo'],
aggfunc='sum').reset_index()
Baker Charlie Delta Echo
0 New Jersey 11.0 5.0 600
1 New York 3.0 6.0 400
Ensure your columns C, D, E are numeric, try df.replace('', 0) or df.fillna(0) to fill your blank cells.
So I have a use case where I have a few tables with different types of events in a time series, plus another table with base information. The events are of different types with different columns, for example an event of "marriage" could have the columns "husband name" and "wife name", and a table of events on "jobs" can have columns of "hired on" and "fired on" but can also have "husband name". The base info table is not time series data, and has stuff like "case ID" and "city of case".
The goal would be to 1. have all the different time series tables in one table with all possible columns, wherever there's no data in a column it's okay to have NaN. And 2. All entries in the time series should have all available data from the base data table.
For example:
df = pd.DataFrame(np.array([['Dave', 1,'call'], ['Josh', 2, 'rejection'], ['Greg', 3,'call']]), columns=['husband name', 'casenum', 'event'])
df2 = pd.DataFrame(np.array([['Dave', 'Mona', 1, 'new lamp'], ['Max', 'Lisa',1, 'big increase'],['Pete', 'Esther',3,'call'], ['Josh', 'Moana', 2, 'delivery']]), columns=['husband name','wife name','casenum', 'event'])
df3 = pd.DataFrame(np.array([[1, 'new york'],[3,'old york'], [2, 'york']]), columns=['casenum','city'])
I'm trying a concat:
concat = pd.concat([df, df2, df3])
This doesn't work, because we already know that for case num 1 the city is 'new york'
I'm trying a join:
innerjoin = pd.merge(df, df2, on='casenum', how='inner')
innerjoin = pd.merge(innerjoin, df3, on='casenum', how='inner')
This also isn't right, as I want to have a record of all the events from both tables. Also, interestingly enough, the result is the same for both inner and outer joins on the dummy data, however, on my actual data an inner join will result in more rows than the sum of both the event tables, which I don't quite understand.
Basically, my desired outcome would be:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Josh 2 rejection NaN york
2 Greg 3 call NaN old york
0 Dave 1 new lamp Mona new york
1 Max 1 big increase Lisa new york
2 Pete 3 call Esther old york
3 Josh 2 delivery Moana york
I've tried inner joins, outer joins, concats, none seem to work. Maybe I'm just too tired, but what do I need to do to get this output? Thank you!
I think you can merge twice with outer option:
(df.merge(df2,on=['husband name', 'casenum', 'event'], how='outer')
.merge(df3, on='casenum')
)
Output:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Dave 1 new lamp Mona new york
2 Max 1 big increase Lisa new york
3 Josh 2 rejection NaN york
4 Josh 2 delivery Moana york
5 Greg 3 call NaN old york
6 Pete 3 call Esther old york
I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")
As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")
We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")
I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")
I'm writing a simple code to have a two-way table of distances between various cities.
Basically, I have a list of cities (say just 3: Paris, Berlin, London), and I created a combination between them with itertools (so I have Paris-Berlin, Paris-London, Berlin-London). I parsed the distances from a website and saved them in a dictionary (so I have: {Paris: {Berlin : 878.36, London : 343.67}, Berlin : {London : 932.14}}).
Now I want to create a two way table, so that I can look up for a pair of cities in Excel (I need it in Excel unfortunately, otherwise with Python all of this would be unnecessary!), and have the distance back. The table has to be complete (ie not triangular, so that I can look for London-Paris, or Paris-London, and the value must be there on both row/column pair). Is something like this possible easily? I was thinking probably I need to fill in my dictionary (ie create something like { Paris : {Berlin : 878.36, London 343.67}, Berlin : {Paris : 878.36, London : 932.14}, London : {Paris : 343.67, Berlin : 932.14}), and then feed it to Pandas, but not sure it's the fastest way. Thank you!
I think this does something like what you need:
import pandas as pd
data = {'Paris': {'Berlin': 878.36, 'London': 343.67}, 'Berlin': {'London': 932.14}}
# Create data frame from dict
df = pd.DataFrame(data)
# Rename index
df.index.name = 'From'
# Make index into a column
df = df.reset_index()
# Turn destination columns into rows
df = df.melt(id_vars='From', var_name='To', value_name='Distance')
# Drop missing values (distance to oneself)
df = df.dropna()
# Concatenate with itself but swapping the order of cities
df = pd.concat([df, df.rename(columns={'From' : 'To', 'To': 'From'})], sort=False)
# Reset index
df = df.reset_index(drop=True)
print(df)
Output:
From To Distance
0 Berlin Paris 878.36
1 London Paris 343.67
2 London Berlin 932.14
3 Paris Berlin 878.36
4 Paris London 343.67
5 Berlin London 932.14
I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?
One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN
If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN
IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven