There are two DataFrames that I want to merge:
DataFrame A columns: index, userid, locale (2000 rows)
DataFrame B columns: index, userid, age (300 rows)
When I perform the following:
pd.merge(A, B, on='userid', how='outer')
I got a DataFrame with the following columns:
index, Unnamed:0, userid, locale, age
The index column and the Unnamed:0 column are identical. I guess the Unnamed:0 column is the index column of DataFrame B.
My question is: is there a way to avoid this Unnamed column when merging two DFs?
I can drop the Unnamed column afterwards, but just wondering if there is a better way to do it.
In summary, what you're doing is saving the index to file and when you're reading back from the file, the column previously saved as index is loaded as a regular column.
There are a few ways to deal with this:
Method 1
When saving a pandas.DataFrame to disk, use index=False like this:
df.to_csv(path, index=False)
Method 2
When reading from file, you can define the column that is to be used as index, like this:
df = pd.read_csv(path, index_col='index')
Method 3
If method #2 does not suit you for some reason, you can always set the column to be used as index later on, like this:
df.set_index('index', inplace=True)
After this point, your datafame should look like this:
userid locale age
index
0 A1092 EN-US 31
1 B9032 SV-SE 23
I hope this helps.
Either don't write index when saving DataFrame to CSV file (df.to_csv('...', index=False)) or if you have to deal with CSV files, which you can't change/edit, use usecols parameter:
A = pd.read_csv('/path/to/fileA.csv', usecols=['userid','locale'])
in order to get rid of the Unnamed:0 column ...
Related
I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T
After i created a data frame and make the function get_dummies on my dataframe:
df_final=pd.get_dummies(df,columns=['type'])
I got the new columns that I want and everything is working.
My question is, how can I get the new columns names of the get dummies? my dataframe is dynamic so I can't call is staticly, I wish to save all the new columns names on List.
An option would be:
df_dummy = pd.get_dummies(df, columns=target_cols)
df_dummy.columns.difference(df.columns).tolist()
where df is your original dataframe, df_dummy the output from pd.get_dummies, and target_cols your list of columns to get the dummies.
I have a dataframe df with one column and 500k rows (df with first 5 elements is given below). I want to add new data in the existing column. The new data is a matrix of 200k rows and 1 column. How can I do it? Also I want add a new column named op.
X098_DE_time
0.046104
-0.037134
-0.089496
-0.084906
-0.038594
We can use concat function after rename the column from second dataframe.
df2.rename(columns={'op':' X098_DE_time'}, inplace=True)
new_df = pd.concat([df, new_df], axis=0)
Note: If we don't rename df2 column, the resultant new_df will have 2 different columns.
To add new column you can use
df["new column"] = [list of values];
I have a new Data-frame df. Which was created using:
df= pd.DataFrame()
I have a date value called 'day' which is in format dd-mm-yyyy and a cost value called 'cost'.
How can I append the date and cost values to the df and assign the date as the index?
So for example if I have the following values
day = 01-01-2001
cost = 123.12
the resulting df would look like
date cost
01-01-2001 123.12
I will eventually be adding paired values for multiple days, so the df will eventually look something like:
date cost
01-01-2001 123.12
02-01-2001 23.25
03-01-2001 124.23
: :
01-07-2016 2.214
I have tried to append the paired values to the data frame but am unsure of the syntax. I've tried various thinks including the below but without success.
df.append([day,cost], columns='date,cost',index_col=[0])
There are a few things here. First, making a column the index goes like this, though you can also do it when you load the dataframe from a file (see below):
df.set_index('date', inplace=True)
To add new rows, you should write them out to file first. Pandas isn't great at adding rows dynamically, and this way you can just read the data in when you need it for analysis.
new_row = ... #a row of new data in string format with values
#separated by commas and ending with \n
with open(path, 'a') as f:
f.write(new_row)
You can do this in a loop, or singly, as many time as you need. Then when you're ready to work with it, you use:
df = pd.read_csv(path, index_col=0, parse_dates=True)
index_col can't take a string name for the index column, so you have to use the index of the order on disk; in my case it makes the first column the index. Passing parse_dates=True will make it turn your datetime strings that you declared as the index into datetime objects.
Try this:
dfapp = [day,cost]
df.append(dfapp)
I have a time series excel file with a tri-level column MultiIndex that I would like to successfully parse if possible. There are some results on how to do this for an index on stack overflow but not the columns and the parse function has a header that does not seem to take a list of rows.
The ExcelFile looks like is like the following:
Column A is all the time series dates starting on A4
Column B has top_level1 (B1) mid_level1 (B2) low_level1 (B3) data (B4-B100+)
Column C has null (C1) null (C2) low_level2 (C3) data (C4-C100+)
Column D has null (D1) mid_level2 (D2) low_level1 (D3) data (D4-D100+)
Column E has null (E1) null (E2) low_level2 (E3) data (E4-E100+)
...
So there are two low_level values many mid_level values and a few top_level values but the trick is the top and mid level values are null and are assumed to be the values to the left. So, for instance all the columns above would have top_level1 as the top multi-index value.
My best idea so far is to use transpose, but the it fills Unnamed: # everywhere and doesn't seem to work. In Pandas 0.13 read_csv seems to have a header parameter that can take a list, but this doesn't seem to work with parse.
You can fillna the null values. I don't have your file, but you can test
#Headers as rows for now
df = pd.read_excel(xls_file,0, header=None, index_col=0)
#fill in Null values in "Headers"
df = df.fillna(method='ffill', axis=1)
#create multiindex column names
df.columns=pd.MultiIndex.from_arrays(df[:3].values, names=['top','mid','low'])
#Just name of index
df.index.name='Date'
#remove 3 rows which are already used as column names
df = df[pd.notnull(df.index)]