I have two dataframes, both with a column 'hotelCode' that is type string. I made sure to convert both columns to string beforehand.
The first dataframe, we'll call old_DF looks like so:
and the second dataframe new_DF looks like:
I have been trying to merge these unsuccessfully. I've tried
final_DF = new_DF.join(old_DF, on = 'hotelCode')
and get this error:
I've tried a variety of things: changing the index name, various merge/join/concat and just haven't been successful.
Ideally, I will have a new dataframe where you have columns [[hotelCode, oldDate, newDate]] under one roof.
import pandas as pd
final_DF = pd.merge(old_DF, new_DF, on='hotelCode', how='outer')
Related
I'm trying to pivot my df from wide to long, and I am attempting to replicate R's dplyr::pivot_longer() function. I have tried pd.wide_to_long() and pd.melt() but have had no success in correctly formatting the df. I also attempted using df.pivot() and come to the same conclusion.
Here is what a subset of the df (called df_wide) looks like: Rows are Store Numbers, Columns are Dates, Values are Total Sales
My current function looks like this:
df_wide.pivot(index = df_wide.index,
columns = ["Store", "Date", "Value"], # Output Col Names
values = df_wide.values)
My desired output is a df that looks like this:
Note - this question is distinct from merging, as it is looking at changing the structure of a single data frame
The stack() function is useful to achieve your objective, then reformat as needed:
pd.DataFrame( df.stack() ).reset_index(drop=False).rename(columns={'level_0':'store', 'level_1':'Date', 0:'Value'})
I have a Pandas dataframe that have duplicate names but with different values, and I want to remove the duplicate names but keep the rows. A snippet of my dataframe looks like this:
And my desired output would look like this:
I've tried using the builtin pandas function .drop_duplicates(), but I end up deleting all duplicates and their respective rows. My current code looks like this:
df = pd.read_csv("merged_db.csv", encoding = "unicode_escape", chunksize=50000)
df = pd.concat(df, ignore_index=True)
df2 = df.drop_duplicates(subset=['auth_given_name', 'auth_surname'])
and this is output I am currently getting:
Basically, I want to return all the values of the coauthor but remove all duplicate data of the original author. My question is what is the best way to achieve the output that I want. I tried using the subset parameter but I don't believe I'm using it correctly.I also found a similar post, but I couldn't really apply it to python. Thank you for your time!
You may consider this code
df = pd.read_csv("merged_db.csv", encoding = "unicode_escape", chunksize=50000)
first_author = df.columns[:24]
df.loc[df.duplicated(first_author), first_author] = np.empty(len(first_author))
print(df)
After i created a data frame and make the function get_dummies on my dataframe:
df_final=pd.get_dummies(df,columns=['type'])
I got the new columns that I want and everything is working.
My question is, how can I get the new columns names of the get dummies? my dataframe is dynamic so I can't call is staticly, I wish to save all the new columns names on List.
An option would be:
df_dummy = pd.get_dummies(df, columns=target_cols)
df_dummy.columns.difference(df.columns).tolist()
where df is your original dataframe, df_dummy the output from pd.get_dummies, and target_cols your list of columns to get the dummies.
I have three separate DataFrames. Each DataFrame has the same columns - ['Email', 'Rating']. There are duplicate row values in all three DataFrames for the column Email. I'm trying to find those emails that appear in all three DataFrames and then create a new DataFrame based off those rows. So far I have I had all three DataFrames saved to a list like this dfs = [df1, df2, df3], and then concatenated them together using df = pd.concat(dfs). I tried using groupby from here but to no avail. Any help would be greatly appreciated
You want to do a merge. Similar to a join in sql you can do an inner merge and treat the email like a foreign key. Here is the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
It would look something like this:
in_common = pd.merge(df1, df2, on=['Email'], how='inner')
you could try using .isin from pandas, e.g:
df[df['Email'].isin(df2['Email'])]
This would retrieve row entries where the values for the column email are the same in the two dataframes.
Another idea is maybe try an inner merge.
Goodluck, post code next time.
Let's say I've pulled csv data from two seperate files containing a date index that pandas automatically pulled which was one of the original columns.
import pandas as pd
df1 = pd.io.parsers.read_csv(data1, parse_dates = True, infer_datetime_format=True, index_col=0, names=['A'])
df2 = pd.io.parsers.read_csv(data2, parse_dates = True, infer_datetime_format=True, index_col=0, names=['A'])
Now the dates for one csv file are different than the other, but when loaded with read_csv, the dates are well defined. I've tried the join command, but it doesn't seem to preserve the dates.
df1 = df1.join(df2)
I get a valid data frame, but the range of the dates is fixed to some smaller subset of what the original range should be given the disparity between the dates for the two csv files. What I would like is a way to create a single dataframe with 2 columns (both 'A' columns) that contains NaN or zero values for the non overlapping dates filled in automatically. Is there a simple solution for this or is there something that I might be missing here. Thanks so much.
By default, pandas DataFrame method 'join' combines two dataframes using 'inner' merging. You want to use 'outer' merging. Your join line should read:
df1 = df1.join(df2, how='outer')
See http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.join.html