Pandas merge method returning empty dataframe - python

I have two dataframes with the following info:
>>> ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
id 5 non-null int64
movie_id 5 non-null object
rating 5 non-null object
account_id 5 non-null int64
dtypes: int64(2), object(2)
memory usage: 240.0+ bytes
>> movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 296 entries, 0 to 295
Data columns (total 9 columns):
id 296 non-null int64
description 296 non-null object
genre 296 non-null object
imdb_url 296 non-null object
img_url 296 non-null object
title 296 non-null object
users_rating 296 non-null object
year 296 non-null object
movie_id 296 non-null object
dtypes: int64(1), object(8)
memory usage: 20.9+ KB
Inspite of the common columns having the same data types, it shows:
>>> pd.merge(ratings,movies)
Empty DataFrame
Columns: [id, movie_id, rating, account_id, description, genre,
imdb_url, img_url, title, users_rating, year]
Index: []
The previous answers on stackoverflow suggest to check for the similarity of data types. However, since my data types are same, what is the solution for this error?

This is doing and inner join using ['id', 'movie_id'] so if the resulting DF is empty then the combinations of id & movie_id in both dataframes don't have any match. Compare the distinct 'id' and 'movie_id' combinations in both dataframes
movies.groupby(['id', 'movie_id'])['id'].count()
ratings.groupby(['id', 'movie_id'])['id'].count()

Related

How to merge 2 dataframes on columns in pandas

I'm having trouble merging two dataframes in pandas. They are parts of a dataset split between two files, and they share some columns and values, namely 'name' and 'address'. The entries with identical values do not share their index with entries in the other file. I tried variations of the following line:
res = pd.merge(df, df_p, on=['name', 'address'], how="left")
When the how argument was set to 'left', the columns from df_p had no values. 'right' had the opposite effect, with columns from df being empty. 'inner' resulted in an empty dataframe and 'outer' duplicated the number of entries, essentially just appending the results of 'left' and 'right'.
I manually verified that there are identical combinations of 'name' and 'address' values in both files.
Edit: Attempt at merging on a single of those columns appears to be successful, however I want to avoid merging incorrect entries in case 2 people with identical names have different addresses and vice versa
Edit1: Here's some more information on the data-set.
df.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3983 entries, 0 to 3982
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3983 non-null int64
1 name 3983 non-null object
2 address 3983 non-null object
3 race 3970 non-null object
4 marital-status 3967 non-null object
5 occupation 3971 non-null object
6 pregnant 3969 non-null object
7 education-num 3965 non-null float64
8 relationship 3968 non-null object
9 skewness_glucose 3972 non-null float64
10 mean_glucose 3572 non-null float64
11 capital-gain 3972 non-null float64
12 kurtosis_glucose 3970 non-null float64
13 education 3968 non-null object
14 fnlwgt 3968 non-null float64
15 class 3969 non-null float64
16 std_glucose 3965 non-null float64
17 income 3974 non-null object
18 medical_info 3968 non-null object
19 native-country 3711 non-null object
20 hours-per-week 3971 non-null float64
21 capital-loss 3969 non-null float64
22 workclass 3968 non-null object
dtypes: float64(10), int64(1), object(12)
memory usage: 715.8+ KB
example entry from df:
0,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201", White, Married-civ-spouse, Exec-managerial,f,9.0, Husband,1.904881822,79.484375,15024.0,0.667177618, HS-grad,147707.0,0.0,39.49544760000001, >50K,"{'mean_oxygen':'1.501672241','std_oxygen':'13.33605383','kurtosis_oxygen':'11.36579476','skewness_oxygen':'156.77910559999995'}", United-States,60.0,0.0, Private
df_p.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3933 non-null int64
1 name 3933 non-null object
2 address 3933 non-null object
3 age 3933 non-null int64
4 sex 3933 non-null object
5 date_of_birth 3933 non-null object
dtypes: int64(2), object(4)
memory usage: 184.5+ KB
sample entry from df_p:
2273,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201",44, Male,1975-03-26
As you can see, the chosen samples are for the same person, but their index does not match, which is why I tried using the name and address columns.
Edit2: Changing the order of df and df_p in the merge seems to have solved the issue, though I have no clue why.

Why is the data displayed after a `merge` different from the actual dataframe in pandas and jupyter notebook?

I merge three dataframes, but the result displayed is different than the actual result. I want the result displayed.
Here is the merge code:
df_twitter_archive_clean_test.merge(df_tweets_clean_test, on='tweet_id', how='left')
df_twitter_archive_clean_test.merge(df_images_clean_test, on='tweet_id')
Here is part of the result that pops up after running this code:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
That result has 28 columns.
But when I run df_twitter_archive_clean_test.info() I get 17 columns!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id 2356 non-null object
in_reply_to_status_id 78 non-null float64
in_reply_to_user_id 78 non-null float64
timestamp 2356 non-null object
source 2356 non-null object
text 2356 non-null object
retweeted_status_id 181 non-null float64
retweeted_status_user_id 181 non-null float64
retweeted_status_timestamp 181 non-null object
expanded_urls 2297 non-null object
rating_numerator 2356 non-null int64
rating_denominator 2356 non-null int64
name 2356 non-null object
doggo 2356 non-null object
floofer 2356 non-null object
pupper 2356 non-null object
puppo 2356 non-null object
dtypes: float64(4), int64(2), object(11)
memory usage: 313.0+ KB
Testing the data reveals that the dataset has 17 columns.
How can I stop this mysterious change?
I assume you are not copying or defining a new variable after the merge therefore you get the information df_twitter_archive_clean_test pre-merge. This can sometimes be solved with an extra parameter called in_place which basically defines if after applying a function to an existing dataframe, changes will be saved in that dataframe, or if assigning the changes to a new dataframe must be done to keep the changes.
If you wish to solve this you can try:
semi_merged_df = df_twitter_archive_clean_test.merge(df_tweets_clean_test, on='tweet_id', how='left')
merged_df = semi_merged_df.merge(df_images_clean_test, on='tweet_id')
And finally
print(merged_df.info())
Should return your expected output.
Changing the merge code to this seems to get of the problem:
df_master = pd.merge(df_twitter_archive_clean_test, df_images_clean_test, how = 'left', on = ['tweet_id'] )
df_master = pd.merge(df_master, df_tweets_clean_test, how = 'left', on = ['tweet_id'])
I don't know why,

How to use pandas for comparisons only of a part of a variable?

I need to compare codes in two dataframes. I'm using Python 3 and pandas
In the first base the codes always have 18 digits:
dividas_dep = pd.read_csv("dividas_deputados_ajustado_csv.csv",sep=';',encoding = 'latin_1')
dividas_dep.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106 entries, 0 to 105
Data columns (total 10 columns):
CPF_Deputado 106 non-null object
CPF_limpo 106 non-null int64
Nome_Deputado 106 non-null object
Vinculo 106 non-null object
CNPJ_Devedor 106 non-null object
CNPJ_limpo 106 non-null int64
Nome_Devedor 106 non-null object
Valores_situacao_Irregular 65 non-null object
Valores_situacao_Regular 52 non-null object
Total_Devido 106 non-null object
dtypes: int64(2), object(8)
memory usage: 8.4+ KB
The column to compare in this first base ("CNPJ_Devedor") has these examples: 17.080.201/0001-49, 76.205.723/0001-99, 04.885.828/0001-25...
And in the second base, the codes always have 10 digits:
funrural = pd.read_excel('DEVEDORES FUNRURAL ATUALIZADO PGFN.xlsx')
funrural.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8130 entries, 0 to 8129
Data columns (total 14 columns):
PSFN_PGFN 8129 non-null object
Regiao 8129 non-null object
CNPJ_CEI_Tipo 8129 non-null object
CNPJ_Raiz 8129 non-null object
Razao_Social 8130 non-null object
Valor_principal 8130 non-null float64
Valor_TR_IPC_Poup 8130 non-null float64
Valor_Juros 8130 non-null float64
Valor_SELIC 8130 non-null float64
Valor_Encargo 8130 non-null float64
Valor_Multa_Oficio 8130 non-null float64
Valor_Selic_M_Oficio 8130 non-null float64
Vl_Multa_Mora 8130 non-null float64
Vl_Tot_Credito 8130 non-null float64
dtypes: float64(9), object(5)
memory usage: 889.3+ KB
The column to compare in this second base ("CNPJ_Raiz") has these examples: 04.244.173, 05.006.407, 03.632.132...
The codes "CNPJ_Devedor" and "CNPJ_Raiz" are related in tax legislation, but I can not make a simple merge like this:
compara1 = pd.merge(dividas_dep, funrural, left_on='CNPJ_Devedor', right_on='CNPJ_Raiz')
What I need to do is compare only the first 10 digits of "CNPJ_Devedor" with the code "CNPJ_Raiz" (example, in "17.080.201/0001-49" use only "17.080.201")
Is there any way to do this in Python? Or should I edit the original dataframe file, dividas_dep (dividas_deputados_ajustado_csv.csv), to create a new column with only the first 10 digits?
You can compare the slice of the first 10 string elements with .str.slice(None, 10):
dividas_dep["CNPJ_Devedor"].str.slice(None, 10) == funrural["CNPJ_Raiz"]
Example:
>>> dividas_dep = pd.DataFrame({"CNPJ_Devedor": ['17.080.201/0001-49', '76.205.723/0001-99', '04.885.828/0001-25']})
>>> funrural = pd.DataFrame({"CNPJ_Raiz": ['17.080.201', '04.244.173', '05.006.407']})
>>> dividas_dep["CNPJ_Devedor"].str.slice(None, 10) == funrural["CNPJ_Raiz"]
0 True
1 False
2 False
dtype: bool
You can use the result to create a new dataframe:
res = dividas_dep["CNPJ_Devedor"].str.slice(None, 10) == funrural["CNPJ_Raiz"]
funrural[res]

Merging dataframes in pandas - Keep getting key error?

I'm trying to merge two data frames, testr and testc, but I keep getting a Key Error on "Channel ID" and not sure what the problem is. Do the dataframes have to be the same size or have the same datatype for pd.merge to work? Here is my code to merge with .info() on each dataframe:
def matchID_RC(rev,cost):
rc = pd.merge(rev, cost, on='Channel ID', how = 'outer')
return rc
testr.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 169 entries, 0 to 168
Data columns (total 7 columns):
Channel ID 169 non-null int64
Channel Name 169 non-null object
Impressions 169 non-null object
Fill Rate 169 non-null object
Gross Rev 169 non-null object
Impression Fees 169 non-null object
Exchange Fees 169 non-null object
dtypes: int64(1), object(6)
memory usage: 10.6+ KB
testc.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 63 entries, 0 to 62
Data columns (total 3 columns):
Channel ID 62 non-null object
Campaign 63 non-null object
Ad Spend 63 non-null float64
dtypes: float64(1), object(2)
memory usage: 2.0+ KB
They need to be the same data type. After all, you can't compare whether a string and an integer are the same.
It seems that 'Channel ID' does exist in the two dataframes, however, in one it is defined as object and in the other as int.
This can be fixed with convert_objects:
def matchID_RC(rev,cost,col='Channel ID'):
rev[col]=rev[col].convert_objects(convert_numeric=True)
cost[col]=cost[col].convert_objects(convert_numeric=True)
rc = pd.merge(rev, cost, on=col, how = 'outer')
return rc
You should change from "on" to "left_on" and "right_on" for better parsings as stated in the official document from Pandas:
Column or index level names to join on. These must be found in both
DataFrames. If on is None and not merging on indexes then this
defaults to the intersection of the columns in both DataFrames.

Having issues reading a .csv file python-pandas

I'm trying to read this .txt file in pandas and this is my result. I thought (naively) that I was getting a hang of this stuff last night, but I'm wrong apparently. If I simply run
rebull = pd.read_table('rebull.txt',sep=' ')
it works, but it gives my result with a disordered array of NaN's I assume from the separations in the initial .txt
RESULT
Try skipinitialspace:
In [26]: pd.read_table('test.txt', sep=' ', skipinitialspace=True)
Out[26]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 386 entries, 0 to 385
Data columns (total 7 columns):
Mon 386 non-null values
id 386 non-null values
NA 386 non-null values
alpha_K24 386 non-null values
class 386 non-null values
alpha_K8 386 non-null values
class.1 0 non-null values
dtypes: float64(3), object(4)
EDIT
Sorry for misunderstanding your problem. I think you can read the table as #DSM mentioned and also set the column names
In [55]: pd.read_table('test.txt', sep=r"\s\s+", header=None, skiprows=[0], names=['Mon id', 'Na', 'alpha_K24', 'class', 'alpha_8', 'class'])
Out[55]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 386 entries, 0 to 385
Data columns (total 6 columns):
Mon id 386 non-null values
Na 386 non-null values
alpha_K24 386 non-null values
class 386 non-null values
alpha_8 386 non-null values
class 386 non-null values
dtypes: float64(2), object(4)
Note that you might set your second class as another name. Or you'll get two columns by df['class']
Figured out my problem...always confirm your indices are joined by hyphens if necessary. In particular my 'Mon id' in the first column was my problem...should be 'Mon-id'.

Categories