appending in pandas - row wise - python

I'm trying to append two columns of my dataframe to an existing dataframe with this:
dataframe.append(df2, ignore_index = True)
and this does not seem to be working.
This is what I'm looking for (kind of) --> a dataframe with 2 columns and 6 rows:
although this is not correct and it's using two print statements to print the two dataframes, I thought it might be helpful to have a selection of the data in mind.
I tried to use concat(), but that leads to some issues as well.
dataframe = pd.concat([dataframe, df2])
but that appears to concat the second dataframe in columns rather than rows, in addition to gicing NaN values:
any ideas on what I should do?

I assume this happened because your dataframes have different column names. Try assigning the second dataframe column names with the first dataframe column names.
df2.columns = dataframe.columns
dataframe_new = pd.concat([dataframe, df2], ignore_index=True)

Related

Join column in dataframe to another dataframe - Pandas

I have 2 dataframes. One has a bunch of columns including f_uuid. The other dataframe has 2 columns, f_uuid and i_uuid.
the first dataframe may contain some f_uuids that the second dataframe doesn't and vice versa.
I want the first dataframe to have a new column i_uuid (from the second dataframe) populated with the appropriate values for the matching f_uuid in that first dataframe.
How would I achieve this?
df1 = pd.merge(df1,
df2,
on='f_uuid')
If you want to keep all f_uuid from df1 (e.g. those not available in df2), you may run
df1 = pd.merge(df1,
df2,
on='f_uuid',
how='left')
I think what your looking for is a merge : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge
In your case, that would look like :
bunch_of_col_df.merge(other_df, on="f_uuid")

Use multiple rows as column header for pandas

I have a dataframe that I've imported as follows.
df = pd.read_excel("./Data.xlsx", sheet_name="Customer Care", header=None)
I would like to set the first three rows as column headers but can't figure out how to do this. I gave the following a try:
df.columns = df.iloc[0:3,:]
but that doesn't seem to work.
I saw something similar in this answer. But it only applies if all sub columns are going to be named the same way, which is not necessarily the case.
Any recommendations would be appreciated.
df = pd.read_excel(
"./Data.xlsx",
sheet_name="Customer Care",
header=[0,1,2]
)
This will tell pandas to read the first three rows of the excel file as multiindex column labels.
If you want to modify the rows after you load them then set them as columns
#set the first three rows as columns
df.columns=pd.MultiIndex.from_arrays(df.iloc[0:3].values)
#delete the first three rows (because they are also the columns
df=df.iloc[3:]

Adding a new column to a pandas dataframe

I have a dataframe df with one column and 500k rows (df with first 5 elements is given below). I want to add new data in the existing column. The new data is a matrix of 200k rows and 1 column. How can I do it? Also I want add a new column named op.
X098_DE_time
0.046104
-0.037134
-0.089496
-0.084906
-0.038594
We can use concat function after rename the column from second dataframe.
df2.rename(columns={'op':' X098_DE_time'}, inplace=True)
new_df = pd.concat([df, new_df], axis=0)
Note: If we don't rename df2 column, the resultant new_df will have 2 different columns.
To add new column you can use
df["new column"] = [list of values];

How can I drop a column from multiple dataframes stored in a list?

Apologies as I'm new to all this.
I'm playing around with pandas at the moment. I want to drop one particular column across two dataframes stored within a list. This is what I've written.
combine = [train, test]
for dataset in combine:
dataset = dataset.drop('Id', axis=1)
However, this doesn't work. If I do this explicitly, such as train = train.drop('Id', axis=1), this works fine.
I appreciate in this case it's two lines either way, but is there some way I can use the list of dataframes to drop the column from both?
The reason why your solution didn't work is because dataset is a name that points to the item in the list combine. You had the right idea to reassign it with dataset = dataset.drop('Id', axis=1) but all you did was overwrite the name dataset and not really place a new dataframe in the list combine
Option 1
Create a new list
combine = [d.drop('Id', axis=1) for d in combine]
Option 2
Or alter each dataframe in place with inplace=True
for d in combine:
d.drop('Id', axis=1, inplace=True)
Or maybe
combine = [df1, df2]
for i in range(len(combine)):
combine[i]=combine[i].drop('Id', axis=1)

Concatenating Two Pandas DataFrames with the same length extends the length of the resulting DataFrame

I have two DataFrames that I am trying to concat together. df_output_norm is the dataframe I am trying to get. X_test_minmax has 81732 rows and 6 columns, y_test has 81732 rows and 1 column. This should be an easy concatenation, but when I concatenate it, the resulting size is (147158, 7)
df_output_norm = pd.DataFrame()
df_output_norm = pd.concat([pd.DataFrame(X_test_minmax), pd.DataFrame(y_test)], axis=1)
print(df_output_norm.shape)
print(df_output_norm['label'].shape)
print(X_test_minmax.shape)
print(y_test.shape)
The output is
(147158, 7)
(147158,)
(81732, 6)
(81732,)
The number of columns is correct, just that the number of rows in the last column is wrong. I've looked at the data and only the last column 'label' is extended, which is the column y_test. The first 6 columns that come from X_test_minmax are of the correct row length. Why is this happening?
Pretty old question, but I landed here looking for a solution to the same problem. I figured out that its because of the mismatch in row indices as the function will be trying to concat on them (very likely that you have chunked your dfs from a larger one by sampling or so).
Try
X_test_minmax.reset_index(inplace=True,drop=True)
y_test.reset_index(inplace=True,drop=True)
df_output_norm = pd.concat([pd.DataFrame(X_test_minmax), pd.DataFrame(y_test)], axis=1)
If ignore_index =True is not working, this is possibly caused by the duplicated column names: One of the column name in the first dataframe is the same as the column name of the second dataframe. Change the column name might help.
Perhaps the index is preventing the result you are seeking. Try
f_output_norm = pd.concat([pd.DataFrame(X_test_minmax), pd.DataFrame(y_test)],
axis=1,
ignore_index=True)
to ignore indexes on the concatenation axis.

Categories