Add new column in Pandas DataFrame Python [duplicate] - python

This question already has answers here:
How to add a new column to an existing DataFrame?
(32 answers)
Closed 4 years ago.
I have dataframe in Pandas for example:
Col1 Col2
A 1
B 2
C 3
Now if I would like to add one more column named Col3 and the value is based on Col2. In formula, if Col2 > 1, then Col3 is 0, otherwise would be 1. So, in the example above. The output would be:
Col1 Col2 Col3
A 1 1
B 2 0
C 3 0
Any idea on how to achieve this?

You just do an opposite comparison. if Col2 <= 1. This will return a boolean Series with False values for those greater than 1 and True values for the other. If you convert it to an int64 dtype, True becomes 1 and False become 0,
df['Col3'] = (df['Col2'] <= 1).astype(int)
If you want a more general solution, where you can assign any number to Col3 depending on the value of Col2 you should do something like:
df['Col3'] = df['Col2'].map(lambda x: 42 if x > 1 else 55)
Or:
df['Col3'] = 0
condition = df['Col2'] > 1
df.loc[condition, 'Col3'] = 42
df.loc[~condition, 'Col3'] = 55

The easiest way that I found for adding a column to a DataFrame was to use the "add" function. Here's a snippet of code, also with the output to a CSV file. Note that including the "columns" argument allows you to set the name of the column (which happens to be the same as the name of the np.array that I used as the source of the data).
# now to create a PANDAS data frame
df = pd.DataFrame(data = FF_maxRSSBasal, columns=['FF_maxRSSBasal'])
# from here on, we use the trick of creating a new dataframe and then "add"ing it
df2 = pd.DataFrame(data = FF_maxRSSPrism, columns=['FF_maxRSSPrism'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = FF_maxRSSPyramidal, columns=['FF_maxRSSPyramidal'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_strainE22, columns=['deltaFF_strainE22'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = scaled, columns=['scaled'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_orientation, columns=['deltaFF_orientation'])
df = df.add( df2, fill_value=0 )
#print(df)
df.to_csv('FF_data_frame.csv')

Related

how to sum up columns from different dataframes into a single dataframe in pandas

Sample data
import pandas as pd
df1 = pd.DataFrame()
df1["Col1"] = [0,2,4,6,2]
df1["Col2"] = [5,1,3,4,0]
df1["Col3"] = [8,0,5,1,7]
df1["Col4"] = [1,4,6,0,8]
#df1_new = df1.iloc[:, 1:3]
df2 = pd.DataFrame()
df2["Col1"] = [8,2,4,6,2,3,5]
df2["Col2"] = [3,7,3,4,0,6,8]
df2["Col3"] = [5,0,5,1,7,9,1]
df2["Col4"] = [0,4,6,0,8,6,0]
#df2_new = df1.iloc[:, 1:3]
dataframes = [df1, df2]
for df in dataframes:
df_new=df.iloc[:, 1:3]
print(df_new.sum(axis=0))
result from above looks like this:
Col2 13
Col3 21
dtype: int64
Col2 31
Col3 28
dtype: int64
But how can I sum up both dataframes and put it into a single one?
Result should look like this:
Real example looks like this:
xlsx_files = glob.glob(os.path.join(path, "*.xlsx"))
#print(csv_files)
# loop over the list of csv files
for f in xlsx_files:
# create df from each excel file
dfs = pd.read_excel(f)
# grab file name to user it in summarized df
file_name = f.split("\\")[-1]
new_df = pd.concat([dfs]).iloc[:,13:28].sum()
You can either sum the dataframes separately and then add the results, or sum the concatenated dataframes:
df1.iloc[:,1:3].sum() + df2.iloc[:,1:3].sum()
pd.concat([df1,df2]).iloc[:,1:3].sum()
In both cases the result is
Col2 44
Col3 49
dtype: int64
You can convert the result from a series to a DataFrame and transpose using
.to_frame().T
to get this output:
Col2 Col3
0 44 49
For the code in your updated question, you probably want something like this:
xlsx_files = glob.glob(os.path.join(path, "*.xlsx"))
#print(csv_files)
# loop over the list of csv files
new_df = pd.DataFrame()
for f in xlsx_files:
# create df from each excel file
dfs = pd.read_excel(f)
# grab file name to user it in summarized df
file_name = f.split("\\")[-1]
new_df = pd.concat([new_df, dfs])
result = new_df.iloc[:,13:28].sum()
here is another way about it
combining the sum of the individual sum of the DFs, converting result to a DF and then choosing Col2 and Col3 after Transposing
(df1.sum() + df2.sum()).to_frame().T[['Col2','Col3']]
Col2 Col3
0 44 49
Get the columnwise sums of both dataframes, take the middle two columns of each, and add them together. Then, transpose the result to turn the rows into columns:
pd.DataFrame((df1.iloc[:, 1:3].sum() + df2.iloc[:, 1:3].sum())).T
This outputs:
Col2 Col3
0 44 49
Here is one way:
long, short = (df1, df2) if len(df1.index) > len(df2.index) else (df2, df1)
print((short[["Col2", "Col3"]].reindex(long.index, fill_value=0) + long[["Col2", "Col3"]]).sum().to_frame().T)
Or, if you need to use iloc for the columns, here is another way:
long, short = (df1, df2) if len(df1.index) > len(df2.index) else (df2, df1)
print((short.iloc[:, 1:3].reindex(long.index, fill_value=0) + long.iloc[:, 1:3]).sum().to_frame().T)
Output (same for both):
Col2 Col3
0 44 49

How to switch column values in the same Pandas DataFrame

I have the following DataFrame:
I need to switch values of col2 and col3 with the values of col4 and col5. Values of col1 will remain the same. The end result needs to look as the following:
Is there a way to do this without looping through the DataFrame?
Use rename in pandas
In [160]: df = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]})
In [161]: df
Out[161]:
A B
0 1 3
1 2 4
2 3 5
In [167]: df.rename({'B':'A','A':'B'},axis=1)
Out[167]:
B A
0 1 3
1 2 4
2 3 5
This should do:
og_cols = df.columns
new_cols = [df.columns[0], *df.columns[3:], *df.columns[1:3]]
df = df[new_cols] # Sort columns in the desired order
df.columns = og_cols # Use original column names
If you want to swap the column values:
df.iloc[:, 1:3], df.iloc[:, 3:] = df.iloc[:,3:].to_numpy(copy=True), df.iloc[:,1:3].to_numpy(copy=True)
Pandas reindex could help :
cols = df.columns
#reposition the columns
df = df.reindex(columns=['col1','col4','col5','col2','col3'])
#pass in new names
df.columns = cols

pandas How to store rows dropped using `drop_duplicates`?

Note: See EDIT below.
I need to keep a log of all rows dropped from my df, but I'm not sure how to capture them. The log should be a data frame that I can update for each .drop or .drop_duplicatesoperation. Here are 3 examples of the code for which I want to log dropped rows:
df_jobs_by_user = df.drop_duplicates(subset=['owner', 'job_number'], keep='first')
df.drop(df.index[indexes], inplace=True)
df = df.drop(df[df.submission_time.dt.strftime('%Y') != '2018'].index)
I found this solution to a different .drop case that uses pd.isnull to recode a pd.dropna statement and so allows a log to be generated prior to actually dropping the rows:
df.dropna(subset=['col2', 'col3']).equals(df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)])
But in trying to adapt it to pd.drop_duplicates, I find there is no pd.isduplicate parallel to pd.isnull, so this may not be the best way to achieve the results I need.
EDIT
I rewrote my question here to be more precise about the result I want.
I start with a df that has one dupe row:
import pandas as pd
import numpy as np
df = pd.DataFrame([['whatever', 'dupe row', 'x'], ['idx 1', 'uniq row', np.nan], ['sth diff', 'dupe row', 'x']], columns=['col1', 'col2', 'col3'])
print(df)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
I then implement the solution from jjp:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df.append(df.loc[mask])
I print the results:
print(df_keep)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
df_keep is what I expect and want.
print(df_droplog)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
2 sth diff dupe row x
df_droplog is not what I want. It includes the rows from index 0 and index 1 which were not dropped and which I therefore do not want in my drop log. It also includes the row from index 2 twice. I want it only once.
What I want:
print(df_droplog)
# Output:
col1 col2 col3
2 sth diff dupe row x
There is a parallel: pd.DataFrame.duplicated returns a Boolean series. You can use it as follows:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['owner', 'job_number'], keep='first')
df_jobs_by_user = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])
Since you only want the duplicated rows in df_droplog, just append only those to an empty dataframe. What you were doing was appending them to the original dataframe df. Try this,
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])

Split CSV file on multiple delimiters and then detect duplicate rows

I am reading a csv file with panda. I need to duplicate rows according to number of strings in a given column (could be multiple). Example, using col1 and separator "|":
in_csv:
col1, col2, col3
ABC|EFG, 1, a
ABC|EFG, 1, bb
ABC|EFG, 2, c
out_csv:
col1, col2, col3
ABC, 1, a
EFG, 1, a
ABC, 1, bb
EFG, 1, bb
ABC, 2, c
EFG, 2, c
I tried reading through a loop row by row, using incsv_dt.row1.iloc[ii].split('|') but I believe there should be an easier way to do it. Strings in col1 being seperated by | could be multiple
Thanks
Unsorted and might not work if there are are entries without the '|' in the first column. Creates two dataframes based on 'col1' and then appends them together. Also might not work if there are multiple '|'s in col1.
df = pd.DataFrame()
df['col1'] = ['APC|EFG', 'APC|EFG','APC|EFG']
df['col2'] = [1,1,2]
df['col3'] = ['a','bb','c']
# split into two columns based on '|' delimiter
df = pd.concat([df, df['col1'].str.split('|', expand = True)], axis=1)
# create two dataframes with new labels
df2 = df.drop(['col1',1], axis=1)
df2.rename(columns={0: 'col1'}, inplace=True)
df3 = df.drop(['col1',0], axis=1)
df3.rename(columns={1: 'col1'}, inplace=True)
# append them together
df = df2.append(df3)
Setup for the example:
df = pd.DataFrame()
df['col1'] = ['APC|EFG', 'APC', 'APC|EFG|XXX']
df['col2'] = [1, 1, 2]
df['col3'] = ['a', 'bb', 'c']
You can first create an new data frame with the split colums.
Then drop the empty values. This works fine if some values have
multiple splits and some have none.
dfs = df['col1'].str.split('|',
expand = True).unstack().reset_index().set_index('level_1')[0].dropna().to_frame()
To merge this with the original dataframe, make sure, the indexes are the same.
When I tried, the original dataframe had a RangeIndex, so I convert that to
Integer
df.index = list(df.index)
Then you can merge the data frames on the index and rename the new column back to 'col1'
df_result = pd.merge(dfs,
dfx[['col2', 'col3']],
left_index=True, right_index=True,
how='outer').rename(columns={0: 'col1'})
print(df_result)
Results in
col1 col2 col3
0 APC 1 a
0 EFG 1 a
1 APC 1 bb
2 APC 2 c
2 EFG 2 c
2 XXX 2 c

Pandas dataframe: Group by two columns and then average over another column

Assuming that I have a dataframe with the following values:
df:
col1 col2 value
1 2 3
1 2 1
2 3 1
I want to first groupby my dataframe based on the first two columns (col1 and col2) and then average over values of the thirs column (value). So the desired output would look like this:
col1 col2 avg-value
1 2 2
2 3 1
I am using the following code:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby('col1','col2').mean())
which gets the following error:
ValueError: No axis named col2 for object type <class 'pandas.core.frame.DataFrame'>
Any help would be much appreciated.
You need to pass a list of the columns to groupby, what you passed was interpreted as the axis param which is why it raised an error:
In [30]:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby(['col1','col2']).mean())
avg
col1 col2
1 2 3
3 3
If you want to group by multiple columns, you should put them in a list:
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).mean())
Or slightly more verbose, for the sake of getting the word 'avg' in your aggregated dataframe:
import numpy as np
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).agg({'value': {'avg': np.mean}}))

Categories