Pandas merge result missing rows when joining on strings - python

I have a data set that I've been cleaning and to clean it I needed to put it into a pivot table to summarize some of the data. I'm now putting it back into a dataframe so that I can merge it with some other dataframes. df1 looks something like this:
Count Region Period ACV PRJ
167 REMAINING US WEST 3/3/2018 5 57
168 REMAINING US WEST 3/31/2018 10 83
169 SAN FRANCISCO 1/13/2018 99 76
170 SAN FRANCISCO 1/20/2018 34 21
df2 looks something like this:
Count MKTcode Region
11 RSMR0 REMAINING US SOUTH
12 RWMR0 REMAINING US WEST
13 SFR00 SAN FRANCISCO
I've tried merging them with this code:
df3 = pd.merge(df1, df2, on='Region', how='inner')
but for some reason pandas is not interpreting the Region columns as the same data and the merge is turning up NaN data in the MKTcode column and it seems to be appending df2 to df1, like this:
Count Region Period ACV PRJ MKTcode
193 WASHINGTON, D.C. 3/3/2018 36 38 NaN
194 WASHINGTON, D.C. 3/31/2018 12 3 NaN
195 ATLANTA NaN NaN NaN ATMR0
196 BOSTON NaN NaN NaN B2MRN
I've tried inner and outer joins, but the real problem seems to be that pandas is interpreting the Region column of each dataframe as different elements.
The MKTcode column and Region column in df2 has only 12 observations and each observation occurs only once, whereas df1 has several repeating instances in the Region column (multiples of the same city). Is there a way where I can just create a list of the 12 MKTcodes that I need and perform a merge where it matches with each region that I designate? Like a one to many match?
Thanks.

When a merge isn't working as expected, the first thing to do is look at the offending columns.
The biggest culprit in most cases is trailing/leading whitespaces. These are usually introduced when DataFrames are incorrectly read from files.
Try getting rid of extra whitespace characters by stripping them out. Assuming you need to join on the "Region" column, use
for df in (df1, df2):
# Strip the column(s) you're planning to join with
df['Region'] = df['Region'].str.strip()
Now, merging should work as expected,
pd.merge(df1, df2, on='Region', how='inner')
Count_x Region Period ACV PRJ Count_y MKTcode
0 167 REMAINING US WEST 3/3/2018 5 57 12 RWMR0
1 168 REMAINING US WEST 3/31/2018 10 83 12 RWMR0
2 169 SAN FRANCISCO 1/13/2018 99 76 13 SFR00
3 170 SAN FRANCISCO 1/20/2018 34 21 13 SFR00
Another possibility if you're still getting NaNs, could be because of a difference in whitespace characters between words. For example, 'REMAINING US WEST' will not compare as equal with 'REMAINING US WEST'.
This time, the fix is to use str.replace:
for df in (df1, df2):
df['Region'] = df['Region'].str.replace(r'\s+', ' ')

Related

How to iterate over columns and check condition by group

I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.

Transform a DataFrame back to integers after np.nan?

I have a DataFrame that looks like this (original is a lot longer):
Country Energy Supply Energy Supply per Capita % Renewable
0 Afghanistan 321 10 78.6693
1 Albania 102 35 100
2 Algeria 1959 51 0.55101
3 American Samoa ... ... 0.641026
4 Andorra 9 121 88.6957
5 Angola 642 27 70.9091
I am trying to replace those pesky '...' with a NaN value using np.nan. But I want to change only those specific '...' values because if I apply np.nan to the df then all the integers are changed to float. I am not sure if I am getting this right, please correct me if I am. The reason why I don't want all the numbers in the df to be float is that I will have to multiply integers by large numbers and it comes up as scientific notation. I tried using this:
energy = energy.replace('...', np.nan)
But as I said, all numbers from df are turned into float.
If you wanna write back into file as integer, df.astype({'col1': 'int32'}) may help.
In numpy, you may need to split integer columns and float columns and operate separately. my_npArray.astype(int) may help you

PANDAS dataframe python: wanting to sort values by group

Link to census data
I have the following link above for a CSV file containing the raw data for which I wish to manipulate.
census_df = df = pd.read_csv('https://raw.githubusercontent.com/Qian-Han/coursera-Applied-Data-Science-with-Python/master/Introduction-to-Data-Science-in-Python/original_data/census.csv')
sortedit = census_df.sort_values(by = ['STNAME','CENSUS2010POP'],ascending=False)
I am trying to order the data in descending order by the column 'CENSUS2010POP'.
I also want to order the data by 'state' alphabetically, hence why I have including the 'STNAME' column in the formula above.
However, I only want to select the 3 highest values for 'CENSUS2010POP' from each state ('STNAME').
Thus, if there are 146 states in total, I should (146 x 3) rows in my new dataframe (and thus in the 'CENSUS2010POP' column).
I would be so grateful if anybody could give me a helping hand?
IIUC, groupby with .nalrgest to create an index filter, chained with sort_values
df2 = df.iloc[df.groupby('STNAME')['CENSUS2010POP']\
.nlargest(3).index.get_level_values(1)]\
.sort_values(['STNAME','CENSUS2010POP'],ascending=True)
print(df['STNAME'].nunique())
51
print(df2.shape)
(152, 100)
print(df2[['STNAME','CENSUS2010POP']])
STNAME CENSUS2010POP
49 Alabama 412992
37 Alabama 658466
0 Alabama 4779736
76 Alaska 97581
71 Alaska 291826
... ... ...
3137 Wisconsin 947735
3096 Wisconsin 5686986
3182 Wyoming 75450
3180 Wyoming 91738
3169 Wyoming 563626
[152 rows x 2 columns]
try this:
df = census_df.groupby(["STNAME"]).apply(lambda x: x.sort_values(["CENSUS2010POP"], ascending = False)).reset_index(drop=True)
df.groupby('STNAME').head(3)[['STNAME','CENSUS2010POP']]
The first statement returns dataframe sorted by CENSUS2010POP in each STNAME.
The second statement returns the top 3.

How do I remove duplicate rows in a Pandas DataFrame based on values in a specific column?

I have two dataframes that have duplicates but I need to remove only the rows that have duplicate VIN numbers and doesn't look at the other cells.
0 230 5UXCR6C50KTQ4xxxx KLL34607 2019 BMW M3
1 116 5UXCR4C00LLW6xxxx LLW63494 2020 BMW X5 Not Found
2 109 5UXCR6C06LLL7xxxx LLL76916 2020 BMW X5 Need Detail
38 229 5UXCR6C50KLL3xxxx MLL23650 2019 BMW X5
43 115 5UXCR4C00LLW6xxxx LLW63494 2020 BMW X5
37 108 5UXCR6C06LLL7xxxx LLL76916 2020 BMW X5
The last 2 rows look like different rows to pandas but I need to merge the two data frames and remove the rows just based on those VIN numbers and ignores the 'Not Found' and 'Need Detail'
I've tried .drop_duplicates .cumsum() and a few other methods but nothing seems to work.
I think what you're trying to say is that you need to concatenate the two dataframes and then remove all duplicated rows based on only a subset of columns.
You can use pd.concat([df1, df2]).drop_duplicates(subset=['VIN'])
where subset is a list of column names that are used to drop the duplicated rows.
(See the documentation for extra details)

Pandas groupby stored in a new dataframe

I have the following code:
import pandas as pd
df1 = pd.DataFrame({'Counterparty':['Bank','Bank','GSE','PSE'],
'Sub Cat':['Tier1','Small','Small', 'Small'],
'Location':['US','US','UK','UK'],
'Amount':[50, 55, 65, 55],
'Amount1':[1,2,3,4]})
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum()
df2.dtypes
df1.dtypes
The df2 data frame does not have the columns that I am aggregating across ( Counterparty and Location). Any ideas why this is the case ? Both Amount and Amount1 are numeric fields. I just want to sum across Amount and aggregate across Amount1
For columns from index add as_index=False parameter or reset_index:
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum().reset_index()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
df2=df1.groupby(['Counterparty','Location'], as_index=False)[['Amount']].sum()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
If aggregate by all columns here happens automatic exclusion of nuisance columns - column Sub Cat is omitted:
df2=df1.groupby(['Counterparty','Location']).sum().reset_index()
print (df2)
Counterparty Location Amount Amount1
0 Bank US 105 3
1 GSE UK 65 3
2 PSE UK 55 4
df2=df1.groupby(['Counterparty','Location'], as_index=False).sum()
Remove the double brackets around the 'Amount' and make them single brackets. You're telling it to only select one column.

Categories