I am trying to merge two dataframes together but when I do it, it seems to replace the existing values with blanks, which isn't really ideal.
df1
col1 col2
'house1' 300
'house2' 450
'house3' 750
df2
col1 col2
'house4' 600
the code I'm using to concat is:
df = pd.concat([df2, df1], sort=False, ignore_index=True)
which yields:
col1 col2
'house4' 600
'house1'
'house2'
'house3'
Is concat the wrong choice? I want to update all the values, but as said, it's overwriting the existing data. I'm sorry if I'm doing a terrible job at explaining this.
You should try append:
df = df1.append(df2, ignore_index=True)
Since you want to add a row, you'll probably want to use pandas append function.
Try this:
df = df1.append(df2)
Related
df1 for example is
col1
col2
col3
abcdef
ghijkl
mnopqr
abcdef1
ghijkl1
mnopqr1
df2 is
col1
ghijkl1
essentially I want to delete rows from df1 where the column2 value doesn't appear in df2col1
the final df1 would be:
col1
col2
col3
abcdef1
ghijkl1
mnopqr1
Make sure you don't have whitespace at the start or end of tha values inside your DataFrame or column names, as the guys pointed out above. i will try to show how to do it in this answer.
# striping whitespace from col names for both DataFrames
df1.rename(columns=lambda x: x.strip(), inplace = True)
df2.rename(columns=lambda x: x.strip(), inplace = True)
# striping whitespace from values in both DataFrames
df1 = df1.apply(lambda x: x.str.strip())
df2 = df2.apply(lambda x: x.str.strip())
# droping rows from df1 where the col2 value doesn't appear in df2 col1
mask = df1["col2"].isin(df2["col1"]) # returns a series of boolean mask(True or False, for the condition we are looking for)
new_df = df1[mask] # Returns what you are asking for :D
Hope this helps.
indexes = df1[!df1.col2.isin(df2.col2)].index
df1 = df1.drop(indexes)
I have dataset, df with some empty values in second column col2.
so I create a new table with same column names and the lenght is equal to number of missings in col2 for df. I call the new dataframe df2.
df[df['col2'].isna()] = df2
But this will return nan for the entire rows where col2 was missing. which means that df[df['col1'].isna()] is now missins everywhere and not only in col2.
Why is that and how Can I fix that?
Assuming that by df2 you really meant a Series, so renaming as s:
df.loc[df['col2'].isna(), 'col2'] = s.values
Example
nan = float('nan')
df = pd.DataFrame({'col1': [1,2,3], 'col2': [nan, 0, nan]})
s = pd.Series([10, 11])
df.loc[df['col2'].isna(), 'col2'] = s.values
>>> df
col1 col2
0 1 10.0
1 2 0.0
2 3 11.0
Note
I don't like this, because it is relying on knowing that the number of NaNs in df is the same length as s. It would be better to know how you create the missing values. With that information, we could probably propose a better and more robust solution.
Suppose I have two data frames:
a = {'col1': ['value_1'], 'col2': ['value_4']}
df_a = pd.DataFramed = pd.DataFrame(data=a)
b = {'col3': [['value_1', 'value_2']], 'col4': ['value_6']}
df_b = pd.DataFramed = pd.DataFrame(data=b)
I want to then merge the two data frames on columns col1 and col3, if the value in col1 is in the list for col3.
The expected result is
>>> df_merged
col1 col2 col3 col4
0 value_1 value_4 ['value_1', 'value_2'] 'value_6'
I am able to deconstruct the list, by getting the list by value:
ids = df_b.iloc[0]['col3']]
and then I can iterate over the list, and insert the list values into new columns in df_b, etc., and then I continue on by doing multiple merges, etc., etc., but this is ugly and seems very arbitrary.
Thus, I am looking for a clean and "pythonic" (read as elegant and generalized) way of doing the merge.
I end with using unnesting method flatten your df_b, then do merge
s=unnesting(df_b,['col3']).reset_index()
newdf=df_a.merge(s[['col3','index']],left_on='col1',right_on='col3',how='left').drop('col3',1)
newdf.merge(df_b,left_on='index',right_index=True,how='left')
col1 col2 index col3 col4
0 value_1 value_4 0 [value_1, value_2] value_6
Note: See EDIT below.
I need to keep a log of all rows dropped from my df, but I'm not sure how to capture them. The log should be a data frame that I can update for each .drop or .drop_duplicatesoperation. Here are 3 examples of the code for which I want to log dropped rows:
df_jobs_by_user = df.drop_duplicates(subset=['owner', 'job_number'], keep='first')
df.drop(df.index[indexes], inplace=True)
df = df.drop(df[df.submission_time.dt.strftime('%Y') != '2018'].index)
I found this solution to a different .drop case that uses pd.isnull to recode a pd.dropna statement and so allows a log to be generated prior to actually dropping the rows:
df.dropna(subset=['col2', 'col3']).equals(df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)])
But in trying to adapt it to pd.drop_duplicates, I find there is no pd.isduplicate parallel to pd.isnull, so this may not be the best way to achieve the results I need.
EDIT
I rewrote my question here to be more precise about the result I want.
I start with a df that has one dupe row:
import pandas as pd
import numpy as np
df = pd.DataFrame([['whatever', 'dupe row', 'x'], ['idx 1', 'uniq row', np.nan], ['sth diff', 'dupe row', 'x']], columns=['col1', 'col2', 'col3'])
print(df)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
I then implement the solution from jjp:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df.append(df.loc[mask])
I print the results:
print(df_keep)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
df_keep is what I expect and want.
print(df_droplog)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
2 sth diff dupe row x
df_droplog is not what I want. It includes the rows from index 0 and index 1 which were not dropped and which I therefore do not want in my drop log. It also includes the row from index 2 twice. I want it only once.
What I want:
print(df_droplog)
# Output:
col1 col2 col3
2 sth diff dupe row x
There is a parallel: pd.DataFrame.duplicated returns a Boolean series. You can use it as follows:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['owner', 'job_number'], keep='first')
df_jobs_by_user = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])
Since you only want the duplicated rows in df_droplog, just append only those to an empty dataframe. What you were doing was appending them to the original dataframe df. Try this,
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])
What's the most pythonic place to drop the columns in a dataframe where the header row is NaN? Preferably inplace.
There may or may not be data in the column.
df = pd.DataFrame({'col1': [1,2,np.NaN], 'col2': [4,5,6], np.NaN: [7,np.NaN,9]})
df.dropna(axis='columns', inplace=True)
Doesn't do it as it looks at the data in the column.
Wanted output
df = pd.DataFrame({'col1': [1,2,np.NaN], 'col2': [4,5,6]})
Thanks in advance for the replies.
Simply try this
df.drop(np.nan, axis=1, inplace=True)
However, if 'no headers' includes None, then jpp's answer will work perfectly at one shot.
Even, in case you have more than one np.nan headers, I don't know how to make df.drop works.
You can use pd.Index.dropna:
df = df[df.columns.dropna()]
print(df)
col1 col2
0 1.0 4
1 2.0 5
2 NaN 6