Problem with joining dataframes using pd.merge

Problem with joining dataframes using pd.merge - python

I have the following formula:
import pandas as pd
houses = houses.reset_index()
houses['difference'] = houses[start]-houses[bottom]
towns = pd.merge(houses, unitowns, how='inner', on=['State','RegionName'])
print(towns)
However, the output is a dataframe 'towns' with 0 rows.
I can't understand why this is given that the dataframe 'houses' looks like this:
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
And the dataframe 'unitowns' looks like this:
State RegionName 2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 ... 2014q2 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3
0 New York New York NaN NaN NaN NaN NaN NaN NaN NaN ... 5.154667e+05 5.228000e+05 5.280667e+05 5.322667e+05 5.408000e+05 5.572000e+05 5.728333e+05 5.828667e+05 5.916333e+05 587200.0
1 California Los Angeles 2.070667e+05 2.144667e+05 2.209667e+05 2.261667e+05 2.330000e+05 2.391000e+05 2.450667e+05 2.530333e+05 ... 4.980333e+05 5.090667e+05 5.188667e+05 5.288000e+05 5.381667e+05 5.472667e+05 5.577333e+05 5.660333e+05 5.774667e+05 584050.0
2 Illinois Chicago 1.384000e+05 1.436333e+05 1.478667e+05 1.521333e+05 1.569333e+05 1.618000e+05 1.664000e+05 1.704333e+05 ... 1.926333e+05 1.957667e+05 2.012667e+05 2.010667e+05 2.060333e+05 2.083000e+05 2.079000e+05 2.060667e+05 2.082000e+05 212000.0
3 Pennsylvania Philadelphia 5.300000e+04 5.363333e+04 5.413333e+04 5.470000e+04 5.533333e+04 5.553333e+04 5.626667e+04 5.753333e+04 ... 1.137333e+05 1.153000e+05 1.156667e+05 1.162000e+05 1.179667e+05 1.212333e+05 1.222000e+05 1.234333e+05 1.269333e+05 128700.0
4 Arizona Phoenix 1.118333e+05 1.143667e+05 1.160000e+05 1.174000e+05 1.196000e+05 1.215667e+05 1.227000e+05 1.243000e+05 ... 1.642667e+05 1.653667e+05 1.685000e+05 1.715333e+05 1.741667e+05 1.790667e+05 1.838333e+05 1.879000e+05 1.914333e+05 195200.0

Related

how to merge Two datasets with different time ranges?

I have two datasets that look like this:
df1:
Date
City
State
Quantity
2019-01
Chicago
IL
35
2019-01
Orlando
FL
322
...
....
...
...
2021-07
Chicago
IL
334
2021-07
Orlando
FL
4332
df2:
Date
City
State
Sales
2020-03
Chicago
IL
30
2020-03
Orlando
FL
319
...
...
...
...
2021-07
Chicago
IL
331
2021-07
Orlando
FL
4000
My date is in format period[M] for both datasets. I have tried using the df1.join(df2,how='outer') and (df2.join(df1,how='outer') commands but they don't add up correctly, essentially, in 2019-01, I have sales for 2020-03. How can I join these two datasets such that my output is as follows:
I have not been able to use merge() because I would have to merge with a combination of City and State and Date
Date
City
State
Quantity
Sales
2019-01
Chicago
IL
35
NaN
2019-01
Orlando
FL
322
NaN
...
...
...
...
...
2021-07
Chicago
IL
334
331
2021-07
Orlando
FL
4332
4000

You can outer-merge. By not specifying the columns to merge on, you merge on the intersection of the columns in both DataFrames (in this case, Date, City and State).
out = df1.merge(df2, how='outer').sort_values(by='Date')
Output:
Date City State Quantity Sales
0 2019-01 Chicago IL 35.0 NaN
1 2019-01 Orlando FL 322.0 NaN
4 2020-03 Chicago IL NaN 30.0
5 2020-03 Orlando FL NaN 319.0
2 2021-07 Chicago IL 334.0 331.0
3 2021-07 Orlando FL 4332.0 4000.0

How to delete rows for column having Non-NaN values

Input Dataframe(df)
Country Region Date Value.....
ABW NaN 01-01-2020 123
ABW NaN 02-01-2020 1234
ABW NaN 03-01-2020 3242
USA NaN 04-01-2020 4354
USA NaN 05-01-2020 43543
USA NaN 06-01-2020 34534
USA NaN 07-01-2020 435
USA WA 08-01-2020 43345
USA WA 09-01-2020 345
USA WV 10-01-2020 345
.
.
.
.
Expected Output(df1)
Country Region Date Value.....
ABW NaN 01-01-2020 123
ABW NaN 02-01-2020 1234
ABW NaN 03-01-2020 3242
USA NaN 04-01-2020 4354
USA NaN 05-01-2020 43543
USA NaN 06-01-2020 34534
USA NaN 07-01-2020 435
.
.
.
.
So from the above dataframe you can see that the column 'Region' has NaN as well as non-null values, I'd like to remove the entire row where column 'Region' has non-NaN values.
Also, AFTER performing the above operation, if I wanted to entirely remove the Region column, how to do that in the fastest possible way(10k+ columns)?? Experts, please help!
FINAL Expected Output
Country Date Value.....
ABW 01-01-2020 123
ABW 02-01-2020 1234
ABW 03-01-2020 3242
USA 04-01-2020 4354
USA 05-01-2020 43543
USA 06-01-2020 34534
USA 07-01-2020 435
Here's the code I tried
df1=df1.isnull(df1['Region'])
Error
df1=df.isnull(df['Region'])
TypeError: isnull() takes 1 positional argument but 2 were given

Using #BEN_YO's suggestion, this is what I did, works fine
filtered_df = df1[df1['Region'].isnull()]

Set "Year" column to individual columns to create a panel

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!

Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

creating new column by merging on column name and other column value

Trying to create a new column in DF1 that lists the home teams number of allstars for that year.
DF1
Date Visitor V_PTS Home H_PTS \
0 2012-10-30 19:00:00 Washington Wizards 84 Cleveland Cavaliers 94
1 2012-10-30 19:30:00 Dallas Mavericks 99 Los Angeles Lakers 91
2 2012-10-30 20:00:00 Boston Celtics 107 Miami Heat 120
3 2012-10-31 19:00:00 Dallas Mavericks 94 Utah Jazz 113
4 2012-10-31 19:00:00 San Antonio Spurs 99 New Orleans Pelicans 95
Attendance Arena Location Capacity \
0 20562 Quicken Loans Arena Cleveland, Ohio 20562
1 18997 Staples Center Los Angeles, California 18997
2 20296 American Airlines Arena Miami, Florida 19600
3 17634 Vivint Smart Home Arena Salt Lake City, Utah 18303
4 15358 Smoothie King Center New Orleans, Louisiana 16867
Yr Arena Opened Season
0 1994 2012-13
1 1992 2012-13
2 1999 2012-13
3 1991 2012-13
4 1999 2012-13
DF2
2012-13 2013-14 2014-15 2015-16 2016-17
Cleveland Cavaliers 1 1 2 1 3
Los Angeles Lakers 2 1 1 1 0
Miami Heat 3 3 2 2 1
Chicago Bulls 2 1 2 2 1
Detroit Pistons 0 0 0 1 1
Los Angeles Clippers 2 2 2 1 1
New Orleans Pelicans 0 1 1 1 1
Philadelphia 76ers 1 0 0 0 0
Phoenix Suns 0 0 0 0 0
Portland Trail Blazers 1 2 2 0 0
Toronto Raptors 0 1 1 2 2
DF1['H_Allstars']=DF2[DF1['Season'],DF1['Home']])
results in TypeError: 'Series' objects are mutable, thus they cannot be hashed
I understand the error just am not sure how else to do it.

I've removed the extra columns and just focused on the necessary ones for demonstration.
Input:
df1
Home 2012-13 2013-14 2014-15 2015-16 2016-17
0 Cleveland Cavaliers 1 1 2 1 3
1 Los Angeles Lakers 2 1 1 1 0
2 Miami Heat 3 3 2 2 1
3 Chicago Bulls 2 1 2 2 1
4 Detroit Pistons 0 0 0 1 1
5 Los Angeles Clippers 2 2 2 1 1
6 New Orleans Pelicans 0 1 1 1 1
7 Philadelphia 76ers 1 0 0 0 0
8 Phoenix Suns 0 0 0 0 0
9 Portland Trail Blazers 1 2 2 0 0
10 Toronto Raptors 0 1 1 2 2
df2
Visitor Home Season
0 Washington Wizards Cleveland Cavaliers 2012-13
1 Dallas Mavericks Los Angeles Lakers 2012-13
2 Boston Celtics Miami Heat 2012-13
3 Dallas Mavericks Utah Jazz 2012-13
4 San Antonio Spurs New Orleans Pelicans 2012-13
Step 1: Melt df1 to get the allstars column
df3 = pd.melt(df1, id_vars='Home', value_vars = df1.columns[df.columns.str.contains('20')], var_name = 'Season', value_name='H_Allstars')
Ouput:
Home Season H_Allstars
0 Cleveland Cavaliers 2012-13 1
1 Los Angeles Lakers 2012-13 2
2 Miami Heat 2012-13 3
3 Chicago Bulls 2012-13 2
4 Detroit Pistons 2012-13 0
5 Los Angeles Clippers 2012-13 2
6 New Orleans Pelicans 2012-13 0
7 Philadelphia 76ers 2012-13 1
8 Phoenix Suns 2012-13 0
...
Step 2: Merge this new dataframe with df2 to get the H_Allstars and V_Allstars columns
df4 = pd.merge(df2, df3, how='left', on=['Home', 'Season'])
Output:
Visitor Home Season H_Allstars
0 Washington Wizards Cleveland Cavaliers 2012-13 1.0
1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0
2 Boston Celtics Miami Heat 2012-13 3.0
3 Dallas Mavericks Utah Jazz 2012-13 NaN
4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0
Step 3: Add the V_Allstars column
# renaming column as required
df3.rename(columns={'Home': 'Visitor', 'H_Allstars': 'V_Allstars'}, inplace=True)
df5 = pd.merge(df4, df3, how='left', on=['Visitor', 'Season'])
Output:
Visitor Home Season H_Allstars V_Allstars
0 Washington Wizards Cleveland Cavaliers 2012-13 1.0 NaN
1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0 NaN
2 Boston Celtics Miami Heat 2012-13 3.0 NaN
3 Dallas Mavericks Utah Jazz 2012-13 NaN NaN
4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0 NaN

You can use pandas.melt . Bring your data df2 to long format, i.e. Home and season as columns and Allstars as values and then merge to df1 on 'Home' and 'Season'.
import pandas as pd
df2['Home'] = df2.index
df2 = pd.melt(df2, id_vars = 'Home', value_vars = ['2012-13', '2013-14', '2014-15', '2015-16', '2016-17'], var_name = 'Season', value_name='H_Allstars')
df = df1.merge(df2, on=['Home','Season'], how='left')

Given conditions for list.append [duplicate]

I need to create a Pandas DataFrame based on a text file based on the following structure:
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]
The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.
Index State Region Name
0 Alabama Aurburn...
1 Alabama Florence...
2 Alabama Jacksonville...
...
9 Alaska Fairbanks...
10 Alaska Arizona...
11 Alaska Flagstaff...
Pandas DataFrame
I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.

You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')
Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
If need all values solution is easier:
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)

You could parse the file into tuples first:
import pandas as pd
from collections import namedtuple
Item = namedtuple('Item', 'state area')
items = []
with open('unis.txt') as f:
for line in f:
l = line.rstrip('\n')
if l.endswith('[edit]'):
state = l.rstrip('[edit]')
else:
i = l.index(' (')
area = l[:i]
items.append(Item(state, area))
df = pd.DataFrame.from_records(items, columns=['State', 'Area'])
print df
output:
State Area
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson

Assuming you have the following DF:
In [73]: df
Out[73]:
text
0 Alabama[edit]
1 Auburn (Auburn University)[1]
2 Florence (University of North Alabama)
3 Jacksonville (Jacksonville State University)[2]
4 Livingston (University of West Alabama)[2]
5 Montevallo (University of Montevallo)[2]
6 Troy (Troy University)[2]
7 Tuscaloosa (University of Alabama, Stillman Co...
8 Tuskegee (Tuskegee University)[5]
9 Alaska[edit]
10 Fairbanks (University of Alaska Fairbanks)[2]
11 Arizona[edit]
12 Flagstaff (Northern Arizona University)[6]
13 Tempe (Arizona State University)
14 Tucson (University of Arizona)
15 Arkansas[edit]
you can use Series.str.extract() method:
In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False)
In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)
In [120]: df.State = df.State.ffill()
In [121]: df
Out[121]:
text State Region Name
0 Alabama[edit] Alabama NaN
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
9 Alaska[edit] Alaska NaN
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
11 Arizona[edit] Arizona NaN
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson
15 Arkansas[edit] Arkansas NaN
In [122]: df = df.dropna()
In [123]: df
Out[123]:
text State Region Name
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson

TL;DR
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]
regex = '(?P<State>.*?)\[edit\]' # pattern to match
print(s.groupby(
# will get nulls where we don't have "[edit]"
# forward fill fills in the most recent line
# where we did have an "[edit]"
s.str.extract(regex, expand=False).ffill()
).apply(
# I still have all the original values
# If I group by the forward filled rows
# I'll want to drop the first one within each group
pd.Series.tail, n=-1
).reset_index(
# munge the dataframe to get columns sorted
name='Region_Name'
)[['State', 'Region_Name']])
State Region_Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
setup
txt = """Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]"""
s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True)

You will probably need to perform some additional manipulation on the file before getting it into a dataframe.
A starting point would be to split the file into lines, search for the string [edit] in each line, put the string name as the key of a dictionary when it is there...
I do not think that Pandas has any built in methods that would handle a file in this format.

You seem to be from Coursera's Introduction to Data Science course. Passed my test with this solution. I would advice not copying the whole solution but using it just for refrence purpose :)
lines = open('university_towns.txt').readlines()
l=[]
lofl=[]
flag=False
for line in lines:
l = []
if('[edit]' in line):
index = line[:-7]
elif('(' in line):
pos = line.find('(')
line = line[:pos-1]
l.append(index)
l.append(line)
flag=True
else:
line = line[:-1]
l.append(index)
l.append(line)
flag=True
if(flag and np.array(l).size!=0):
lofl.append(l)
df = pd.DataFrame(lofl,columns=["State","RegionName"])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem with joining dataframes using pd.merge - python

Related

how to merge Two datasets with different time ranges?

How to delete rows for column having Non-NaN values

Set "Year" column to individual columns to create a panel

creating new column by merging on column name and other column value

Given conditions for list.append [duplicate]

Categories

Resources