I need to convert 3 columns into 2 rows using python.
col1 col2 col3
A 2 3
B 4 5
col1 col2
A 2
A 3
B 4
B 5
*my code
hdr = ['col1', 'col2']
final_output=[]
for row in rows:
output = {}
output1 = {}
output = { A : row.get(col1), B: row.get(col2)}
output1 = { A : row.get(col1), B: row.get(col3)}
final_out.append(output)
final_out.append(output1)
with open(tgt_file.csv, w) as tgt_file:
csv_writer=csv.DictWriter(tgt_file, fieldnames=hdr, delimiter=',')
csv_writer.writeheader()
csv_writer.writerows(final_output)
import pandas as pd
### this is the sample data
df = pd.DataFrame(data= [['A',2, 3],['B',4, 5]],
columns =['col1', 'col2', 'col3'])
### this is the solution
ef = [] # create an empty list
for i,row in df.iterrows():
ef.append([row[0], row[1]]) # append first column first
ef.append([row[0], row[2]]) # append 2nd column second
df = pd.DataFrame(data=ef,columns=['col1','col2']) # recreate the dataframe
remark: there are more advanced solutions possible, but I think this is readable
You can try using pd.melt
df = pd.melt(df, id_vars=["col1"],value_name = 'col2').drop(['variable'],axis=1)
And then you can sort the dataframe on "col1".
Related
Sample data
import pandas as pd
df1 = pd.DataFrame()
df1["Col1"] = [0,2,4,6,2]
df1["Col2"] = [5,1,3,4,0]
df1["Col3"] = [8,0,5,1,7]
df1["Col4"] = [1,4,6,0,8]
#df1_new = df1.iloc[:, 1:3]
df2 = pd.DataFrame()
df2["Col1"] = [8,2,4,6,2,3,5]
df2["Col2"] = [3,7,3,4,0,6,8]
df2["Col3"] = [5,0,5,1,7,9,1]
df2["Col4"] = [0,4,6,0,8,6,0]
#df2_new = df1.iloc[:, 1:3]
dataframes = [df1, df2]
for df in dataframes:
df_new=df.iloc[:, 1:3]
print(df_new.sum(axis=0))
result from above looks like this:
Col2 13
Col3 21
dtype: int64
Col2 31
Col3 28
dtype: int64
But how can I sum up both dataframes and put it into a single one?
Result should look like this:
Real example looks like this:
xlsx_files = glob.glob(os.path.join(path, "*.xlsx"))
#print(csv_files)
# loop over the list of csv files
for f in xlsx_files:
# create df from each excel file
dfs = pd.read_excel(f)
# grab file name to user it in summarized df
file_name = f.split("\\")[-1]
new_df = pd.concat([dfs]).iloc[:,13:28].sum()
You can either sum the dataframes separately and then add the results, or sum the concatenated dataframes:
df1.iloc[:,1:3].sum() + df2.iloc[:,1:3].sum()
pd.concat([df1,df2]).iloc[:,1:3].sum()
In both cases the result is
Col2 44
Col3 49
dtype: int64
You can convert the result from a series to a DataFrame and transpose using
.to_frame().T
to get this output:
Col2 Col3
0 44 49
For the code in your updated question, you probably want something like this:
xlsx_files = glob.glob(os.path.join(path, "*.xlsx"))
#print(csv_files)
# loop over the list of csv files
new_df = pd.DataFrame()
for f in xlsx_files:
# create df from each excel file
dfs = pd.read_excel(f)
# grab file name to user it in summarized df
file_name = f.split("\\")[-1]
new_df = pd.concat([new_df, dfs])
result = new_df.iloc[:,13:28].sum()
here is another way about it
combining the sum of the individual sum of the DFs, converting result to a DF and then choosing Col2 and Col3 after Transposing
(df1.sum() + df2.sum()).to_frame().T[['Col2','Col3']]
Col2 Col3
0 44 49
Get the columnwise sums of both dataframes, take the middle two columns of each, and add them together. Then, transpose the result to turn the rows into columns:
pd.DataFrame((df1.iloc[:, 1:3].sum() + df2.iloc[:, 1:3].sum())).T
This outputs:
Col2 Col3
0 44 49
Here is one way:
long, short = (df1, df2) if len(df1.index) > len(df2.index) else (df2, df1)
print((short[["Col2", "Col3"]].reindex(long.index, fill_value=0) + long[["Col2", "Col3"]]).sum().to_frame().T)
Or, if you need to use iloc for the columns, here is another way:
long, short = (df1, df2) if len(df1.index) > len(df2.index) else (df2, df1)
print((short.iloc[:, 1:3].reindex(long.index, fill_value=0) + long.iloc[:, 1:3]).sum().to_frame().T)
Output (same for both):
Col2 Col3
0 44 49
I have the following DataFrame:
I need to switch values of col2 and col3 with the values of col4 and col5. Values of col1 will remain the same. The end result needs to look as the following:
Is there a way to do this without looping through the DataFrame?
Use rename in pandas
In [160]: df = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]})
In [161]: df
Out[161]:
A B
0 1 3
1 2 4
2 3 5
In [167]: df.rename({'B':'A','A':'B'},axis=1)
Out[167]:
B A
0 1 3
1 2 4
2 3 5
This should do:
og_cols = df.columns
new_cols = [df.columns[0], *df.columns[3:], *df.columns[1:3]]
df = df[new_cols] # Sort columns in the desired order
df.columns = og_cols # Use original column names
If you want to swap the column values:
df.iloc[:, 1:3], df.iloc[:, 3:] = df.iloc[:,3:].to_numpy(copy=True), df.iloc[:,1:3].to_numpy(copy=True)
Pandas reindex could help :
cols = df.columns
#reposition the columns
df = df.reindex(columns=['col1','col4','col5','col2','col3'])
#pass in new names
df.columns = cols
Note: See EDIT below.
I need to keep a log of all rows dropped from my df, but I'm not sure how to capture them. The log should be a data frame that I can update for each .drop or .drop_duplicatesoperation. Here are 3 examples of the code for which I want to log dropped rows:
df_jobs_by_user = df.drop_duplicates(subset=['owner', 'job_number'], keep='first')
df.drop(df.index[indexes], inplace=True)
df = df.drop(df[df.submission_time.dt.strftime('%Y') != '2018'].index)
I found this solution to a different .drop case that uses pd.isnull to recode a pd.dropna statement and so allows a log to be generated prior to actually dropping the rows:
df.dropna(subset=['col2', 'col3']).equals(df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)])
But in trying to adapt it to pd.drop_duplicates, I find there is no pd.isduplicate parallel to pd.isnull, so this may not be the best way to achieve the results I need.
EDIT
I rewrote my question here to be more precise about the result I want.
I start with a df that has one dupe row:
import pandas as pd
import numpy as np
df = pd.DataFrame([['whatever', 'dupe row', 'x'], ['idx 1', 'uniq row', np.nan], ['sth diff', 'dupe row', 'x']], columns=['col1', 'col2', 'col3'])
print(df)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
I then implement the solution from jjp:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df.append(df.loc[mask])
I print the results:
print(df_keep)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
df_keep is what I expect and want.
print(df_droplog)
# Output:
col1 col2 col3
0 whatever dupe row x
1 idx 1 uniq row NaN
2 sth diff dupe row x
2 sth diff dupe row x
df_droplog is not what I want. It includes the rows from index 0 and index 1 which were not dropped and which I therefore do not want in my drop log. It also includes the row from index 2 twice. I want it only once.
What I want:
print(df_droplog)
# Output:
col1 col2 col3
2 sth diff dupe row x
There is a parallel: pd.DataFrame.duplicated returns a Boolean series. You can use it as follows:
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['owner', 'job_number'], keep='first')
df_jobs_by_user = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])
Since you only want the duplicated rows in df_droplog, just append only those to an empty dataframe. What you were doing was appending them to the original dataframe df. Try this,
df_droplog = pd.DataFrame()
mask = df.duplicated(subset=['col2', 'col3'], keep='first')
df_keep = df.loc[~mask]
df_droplog = df_droplog.append(df.loc[mask])
I am reading a csv file with panda. I need to duplicate rows according to number of strings in a given column (could be multiple). Example, using col1 and separator "|":
in_csv:
col1, col2, col3
ABC|EFG, 1, a
ABC|EFG, 1, bb
ABC|EFG, 2, c
out_csv:
col1, col2, col3
ABC, 1, a
EFG, 1, a
ABC, 1, bb
EFG, 1, bb
ABC, 2, c
EFG, 2, c
I tried reading through a loop row by row, using incsv_dt.row1.iloc[ii].split('|') but I believe there should be an easier way to do it. Strings in col1 being seperated by | could be multiple
Thanks
Unsorted and might not work if there are are entries without the '|' in the first column. Creates two dataframes based on 'col1' and then appends them together. Also might not work if there are multiple '|'s in col1.
df = pd.DataFrame()
df['col1'] = ['APC|EFG', 'APC|EFG','APC|EFG']
df['col2'] = [1,1,2]
df['col3'] = ['a','bb','c']
# split into two columns based on '|' delimiter
df = pd.concat([df, df['col1'].str.split('|', expand = True)], axis=1)
# create two dataframes with new labels
df2 = df.drop(['col1',1], axis=1)
df2.rename(columns={0: 'col1'}, inplace=True)
df3 = df.drop(['col1',0], axis=1)
df3.rename(columns={1: 'col1'}, inplace=True)
# append them together
df = df2.append(df3)
Setup for the example:
df = pd.DataFrame()
df['col1'] = ['APC|EFG', 'APC', 'APC|EFG|XXX']
df['col2'] = [1, 1, 2]
df['col3'] = ['a', 'bb', 'c']
You can first create an new data frame with the split colums.
Then drop the empty values. This works fine if some values have
multiple splits and some have none.
dfs = df['col1'].str.split('|',
expand = True).unstack().reset_index().set_index('level_1')[0].dropna().to_frame()
To merge this with the original dataframe, make sure, the indexes are the same.
When I tried, the original dataframe had a RangeIndex, so I convert that to
Integer
df.index = list(df.index)
Then you can merge the data frames on the index and rename the new column back to 'col1'
df_result = pd.merge(dfs,
dfx[['col2', 'col3']],
left_index=True, right_index=True,
how='outer').rename(columns={0: 'col1'})
print(df_result)
Results in
col1 col2 col3
0 APC 1 a
0 EFG 1 a
1 APC 1 bb
2 APC 2 c
2 EFG 2 c
2 XXX 2 c
Assuming that I have a dataframe with the following values:
df:
col1 col2 value
1 2 3
1 2 1
2 3 1
I want to first groupby my dataframe based on the first two columns (col1 and col2) and then average over values of the thirs column (value). So the desired output would look like this:
col1 col2 avg-value
1 2 2
2 3 1
I am using the following code:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby('col1','col2').mean())
which gets the following error:
ValueError: No axis named col2 for object type <class 'pandas.core.frame.DataFrame'>
Any help would be much appreciated.
You need to pass a list of the columns to groupby, what you passed was interpreted as the axis param which is why it raised an error:
In [30]:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby(['col1','col2']).mean())
avg
col1 col2
1 2 3
3 3
If you want to group by multiple columns, you should put them in a list:
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).mean())
Or slightly more verbose, for the sake of getting the word 'avg' in your aggregated dataframe:
import numpy as np
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).agg({'value': {'avg': np.mean}}))