I have tried to find how to append a Total Row, which sums the columns.
There is a elegant solution to this problem here: [SOLVED] Pandas dataframe total row
However, when using this method, I have noticed a warning message:
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
I have tried to use this alternative, in order to avoid using legacy code. When I tried to use concat method, it was appending the sum row to the last column vertically.
Code I used:
pd.concat( [df, df.sum(numeric_only=True)] )
Result:
a b c d e
0 2200.0 14.30 NaN 2185.70 NaN
1 3200.0 20.80 NaN 3179.20 NaN
2 6400.0 41.60 NaN 6358.40 NaN
3 NaN NaN NaN NaN 11800.00 <-- Appended using Concat
4 NaN NaN NaN NaN 76.70 <-- Appended using Concat
5 NaN NaN NaN NaN 0.00 <-- Appended using Concat
6 NaN NaN NaN NaN 11723.30 <-- Appended using Concat
What I want:
a b c d
0 2200.0 14.30 NaN 2185.70
1 3200.0 20.80 NaN 3179.20
2 6400.0 41.60 NaN 6358.40
3 11800.00 76.70 0.00 11723.30 <-- Appended using Concat
Is there an elegant solution to this problem using concat method?
Convert the sum (which is a pandas.Series) to a DataFrame and transpose before concat:
>>> pd.concat([df,df.sum().to_frame().T], ignore_index=True)
a b c d
0 2200.0 14.3 NaN 2185.7
1 3200.0 20.8 NaN 3179.2
2 6400.0 41.6 NaN 6358.4
3 11800.0 76.7 0.0 11723.3
Related
I have 2 dataframes that contain 3 account indictors per account number. The account numbers are like for like in the column labelled "account". I am trying to modify dataframe 2 so it matches dataframe 1 in terms of having the same NaN values for each column.
Dataframe 1:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1234567890,1,np.nan,'G'],
[7854567890,np.nan,100,np.nan],
[7854567899,np.nan,np.nan,np.nan],
[7854567893,np.nan,100,np.nan],
[7854567893,np.nan,np.nan,np.nan],
[9632587415,np.nan,np.nan,'B']],
columns = ['account','ind_1','ind_2','ind_3'])
df
Output:
account ind_1 ind_2 ind_3
0 1234567890 1.0 NaN G
1 7854567890 NaN 100.0 NaN
2 7854567899 NaN NaN NaN
3 7854567893 NaN 100.0 NaN
4 7854567893 NaN NaN NaN
5 9632587415 NaN NaN B
Dataframe 2:
df2 = pd.DataFrame([[1234567890,5,np.nan,'GG'],
[7854567890,1,106,np.nan],
[7854567899,np.nan,100,'N'],
[7854567893,np.nan,100,np.nan],
[7854567893,np.nan,np.nan,np.nan],
[9632587415,3,np.nan,'B']],
columns = ['account','ind_1','ind_2','ind_3'])
df2
Output:
account ind_1 ind_2 ind_3
0 1234567890 5.0 NaN GG
1 7854567890 1.0 106.0 NaN
2 7854567899 NaN 100.0 N
3 7854567893 NaN 100.0 NaN
4 7854567893 NaN NaN NaN
5 9632587415 3.0 NaN B
Problem:
I need to change dataframe 2 so that it is matching the same NaN values in dataframe 1.
For example: Column ind_1 has values in index 0, 1 and 5 in df2, whereas in df1 it only has values in index 0. I need to replace the values in index 1 and 5 in df2 with NaNs to match the same number of Nans in df1. Same logic to applied to the other 2 columns.
Expected outcome:
account ind_1 ind_2 ind_3
0 1234567890 5.0 NaN GG
1 7854567890 NaN 106.0 NaN
2 7854567899 NaN NaN NaN
3 7854567893 NaN 100.0 NaN
4 7854567893 NaN NaN NaN
5 9632587415 NaN NaN B
Is there any easy way to achieve this?
Thanks in advance
Alan
Try this
df2[~df.isna()]
df.isna() check where df has nan value and create a mask
then slice df2 with the mask.
You return a DataFrame with True values where there are NaN values in your first dataframe using df.isna(). You can use this as a Boolean mask to set the correct cells to NaN in your second DataFrame:
df2[df.isna()] = None
Be careful: I assume that you also need to take care that these NaNs will be associated with rows with the same value for account. This solution does not ensure that this will happen, i.e. it will assume that the values in account column are exactly int he same order in both dataframes.
i have edited this post with the specific case:
i have a list of dataframes like this (note that df1 and df2 have a row in common)
df1
index
Date
A
0
2010-06-19
4
1
2010-06-20
3
2
2010-06-21
2
3
2010-06-22
1
4
2012-07-19
5
df2
index
Date
B
0
2012-07-19
5
1
2012-07-20
6
df3
index
Date
C
0
2020-06-19
5
1
2020-06-20
2
2
2020-06-21
9
df_list = [df1, df2, df3]
I would like to merge all dataframes in a single dataframe, without losing rows and placing nan where there are no things to merge. The criteria would be merging them by the column 'Date' (the column should have all the dates of all the merged dataframes, ordered by date).
The resulting dataframe should look like this:
Resulting Dataframe:
index
Date
A
B
C
0
2010-06-19
4
nan
nan
1
2010-06-20
3
nan
nan
2
2010-06-21
2
nan
nan
3
2010-06-22
1
nan
nan
4
2012-07-19
5
5
nan
5
2012-07-20
nan
6
nan
6
2020-06-19
nan
nan
5
7
2020-06-20
nan
nan
2
8
2020-06-21
nan
nan
9
I tried something like this:
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Date'], how='outer'), df_list)
BUT the resulting dataframe is not as expected (i miss some columns and is not ordered by date). I think i am missing something.
Thank you very much
Use pandas.concat(). It takes a list of dataframes, and appends common columns together, filling new columns with NaN as necessary:
new_df = pd.concat([df1, df2, df3])
Output:
>>> new_df
index Date A B C
0 0 2010-06-19 4.0 NaN NaN
1 1 2010-06-20 3.0 NaN NaN
2 2 2010-06-21 2.0 NaN NaN
3 3 2010-06-22 1.0 NaN NaN
0 0 2012-07-19 NaN 5.0 NaN
1 1 2012-07-20 NaN 6.0 NaN
0 0 2020-06-19 NaN NaN 5.0
1 1 2020-06-20 NaN NaN 2.0
2 2 2020-06-21 NaN NaN 9.0
For overlapping data, I had to add the option: Sort = TRUE in the lambda function. Seemed I was missing the order for big dataframes and I was only seeing the nan at the end and start of frames. Thank you all ;-)
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Date'],
how='outer', sort=True), df_list)
I have a Pandas DataFrame like the following:
timestamp A B C D E F
0 1607594400000 83.69 NaN NaN NaN 1003.20 8.66
1 1607594400000 NaN 2.57 44.35 17.18 NaN NaN
2 1607595000000 83.07 NaN NaN NaN 1003.32 8.68
3 1607595000000 NaN 3.00 42.31 20.08 NaN NaN
.. ... ... ... ... ... ... ...
325 1607691600000 90.19 NaN NaN NaN 997.32 10.22
326 1607691600000 NaN 1.80 30.10 14.85 NaN NaN
328 1607692200000 NaN 1.60 26.06 12.78 NaN NaN
327 1607692200000 91.33 NaN NaN NaN 997.52 10.21
I need to combine the rows that have the same value for timestamp, where in the cases where there is nan-value the value is maintained and in the cases where there is value-value the average of the values is calculated.
I tried the solution of the following question but it is not exactly my situation and I don't know how to addapt it:
pandas, combine rows based on certain column values and NAN
Just use groupby:
df.groupby('timestamp', as_index=False).mean()
Try with first, it will pick the not null value for each group
out = df.groupby('timestamp', as_index=False).first()
Or
out = df.set_index('timestamp').mean(level=0)
I'm translating some datacleaning-stuff previously done in SPSS modeler to Python. In SPSS you have a 'node' that is called restructure. I'm trying to figure out how to do the same operation in Python, but I'm struggling on how to achieve this. What it does is combining every value in column X with all values in different columns A,B,C, etc... .
So, original dataframe looks like this:
Code Freq1 Freq2
A01 1 7
B02 0 6
C03 17 8
And after transforming it it should look like this:
Code Freq1 Freq2 A01_Freq1 A01_Freq2 B02_Freq1 B02_Freq2 C03_Freq1 C03_Freq2
A01 1 7 1 7 Nan Nan Nan Nan
B02 0 6 Nan Nan 0 6 Nan Nan
C03 17 8 Nan Nan Nan Nan 17 8
I've tried some pivoting stuff, but I guess this cannot be done in one step in Python...
Use DataFrame.set_index with DataFrame.unstack and DataFrame.sort_index for new DataFrame with MultiIndex, then flatten it by f-strings and last add to original by DataFrame.join:
df1 = df.set_index('Code', append=True).unstack().sort_index(axis=1, level=1)
df1.columns = df1.columns.map(lambda x: f'{x[1]}_{x[0]}')
df = df.join(df1)
print (df)
Code Freq1 Freq2 A01_Freq1 A01_Freq2 B02_Freq1 B02_Freq2 C03_Freq1 \
0 A01 1 7 1.0 7.0 NaN NaN NaN
1 B02 0 6 NaN NaN 0.0 6.0 NaN
2 C03 17 8 NaN NaN NaN NaN 17.0
C03_Freq2
0 NaN
1 NaN
2 8.0
I'm analyzing excel files generated by an organization who publishes yearly reports in Excel files. Each year, the column names (Year, A1, B1, C1, etc) remain identical. But each year the organization publishes those column names that start at different row numbers and column numbers.
Each year I manually search for the starting row and column, but it's tedious work given the number of years of reports to wade through.
So I'd like something like this:
...
df = pd.read_excel('test.xlsx')
start_row,start_col = df.find_columns('Year','A1','B1')
...
Thanks.
Let's say you have three .xlsx files on your desktop prefixed with Yearly_Report that when combined in python look like this after reading into one dataframe with something like: df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]):
0 1 2 3 4 5 6 7 8 9 10
0 A B C NaN NaN NaN NaN NaN NaN NaN NaN
1 1 2 3 NaN NaN NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN A B C NaN NaN NaN NaN NaN NaN
4 NaN NaN 4 5 6 NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN A B C
2 NaN NaN NaN NaN NaN NaN NaN NaN 4 5 6
As you can see, the columns and values are scattered across various columns and rows. The following steps would get you the desired result. First, you need to pd.concat the files and .dropna rows. Then, transpose the dataframe with .T before removing all cells with NaN values. Next, revert the dataframe back with another transpose .T. Finally, simply name the columns and drop rows that are equal to the column headers.
import glob, os
import pandas as pd
main_folder = 'Desktop/'
yearly_files = glob.glob(f'{main_folder}Yearly_Report*.xlsx')
df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]) \
.dropna(how='all').T \
.apply(lambda x: pd.Series(x.dropna().values)).T
df.columns = ['A','B','C']
df = df[df['A'] != 'A']
df
output:
A B C
1 1 2 3
4 4 5 6
2 4 5 6
Soething Like this not totally sure what you are looking for
df = pd.read_excel('test.xlsx')
for i in df.index:
print(df.loc[i,'Year'])
print(df.loc[i, 'A1'])
print(df.loc[i, "B1"])