Append rows to dataframe, add new columns if not exist - python

I have a df like below which
>>df
group sub_group max
0 A 1 30.0
1 B 1 300.0
2 B 2 3.0
3 A 2 2.0
I need to have group and sub_group as atrributes (columns) and max as row
So I do
>>> newdf.set_index(['group','sub_group']).T
group A B A
sub_group 1 1 2 2
max 30.0 300.0 3.0 2.0
This gives me my intended formatting
Now I need to merge it to another similar dataframe say
>>df2
group sub_group max
0 C 1 3000.0
1 A 1 4000.0
Such that my merge results in
group A B A C
sub_group 1 1 2 2 1
max 30.0 300.0 3.0 2.0 NaN
max 4000.0 NaN NaN NaN 3000.0
Basically at every new df we are placing values under appropriate heading, if there is a new group or subgroup we add it the larger df. I am not sure if my way of transposing and then trying to merge append is a good approach
Since these df are generated in loop (loop items being dates), I would like to get a way to replace max printed in 1st column(of expected op) by loop date.
dates=['20170525', '20170623', '20170726']
for date in dates:
df = pd.read_csv()

I think you can add parameter index_col to read_csv first for Multiindex from first and second column:
dfs = []
for date in dates:
df = pd.read_csv('name', index_col=[0,1])
dfs.append(df)
#another test df was added
print (df3)
max
group sub_group
D 1 3000.0
E 1 4000.0
Then concat them together with parameter keys by list, then reshape by unstack and transpose:
#dfs = [df,df2,df3]
dates=['20170525', '20170623', '20170726']
df = pd.concat(dfs, keys=dates)['max'].unstack(0).T
print (df)
group A B C D E
sub_group 1 2 1 2 1 1 1
20170525 30.0 2.0 300.0 3.0 NaN NaN NaN
20170623 4000.0 NaN NaN NaN 3000.0 NaN NaN
20170726 NaN NaN NaN NaN NaN 3000.0 4000.0

Related

collapse similarly prefixed columns in pandas dataframe, convert into row_index

In short, I just want each unique value of the "ts_" prefixed columns converted into a row index. I intend to use the 'ts' and 'id' column as a multi-index.
rows = [{'id':1, 'a_ts':'2020-10-02','a_energy':6,'a_money':2,'b_ts':'2020-10-02', 'b_color':'blue'},
{'id':2, 'a_ts':'2020-02-02','a_energy':2,'a_money':5, 'a_color':'orange', 'b_ts':'2012-08-11', 'b_money':10, 'b_color':'blue'},
{'id':3,'a_ts':'2011-02-02', 'a_energy':4}]
df = pd.DataFrame(rows)
id a_ts a_energy a_money b_ts b_color a_color b_money
0 1 2020-10-02 6 2.0 2020-10-02 blue NaN NaN
1 2 2020-02-02 2 5.0 2012-08-11 blue orange 10.0
2 3 2011-02-02 4 NaN NaN NaN NaN NaN
I want my output to look something like this.
energy money color
id ts
1 2020-10-02 6.0 2.0 blue
2 2020-02-02 2.0 5.0 orange
2012-08-11 NaN 10.0 blue
3 2011-02-02 4.0 NaN NaN
The best I could come up with was splitting the columns with an underscore and resetting the indexes, but that creates rows where the the ids and timestamp are NaN.
I cannot simply create rows with NaNs, then get rid of all these rows. As I'll lose information about which ID's did not contain a timestamp or what timestamps did not have a matched id (this is because the dataframes are the result of a join).
df.columns = df.columns.str.split("ts_", expand=True)
df = df.stack().reset_index(drop=True)
Use:
df = df.set_index(['id'])
df.columns = df.columns.str.split("_", expand=True)
df = df.stack(0).reset_index(level=-1,drop=True).reset_index()
print (df)
id color energy money ts
0 1 NaN 6.0 2.0 2020-10-02
1 1 blue NaN NaN 2020-10-02
2 2 orange 2.0 5.0 2020-02-02
3 2 blue NaN 10.0 2012-08-11
4 3 NaN 4.0 NaN 2011-02-02
And then shift values per groups with removed only NaNs rows by custom lambda functions:
f = lambda x: x.apply(lambda y: pd.Series(y.dropna().tolist()))
df = df.set_index(['id','ts']).groupby(['id','ts']).apply(f).droplevel(-1)
print (df)
color energy money
id ts
1 2020-10-02 blue 6.0 2.0
2 2012-08-11 blue NaN 10.0
2020-02-02 orange 2.0 5.0
3 2011-02-02 NaN 4.0 NaN

Merge a list of dataframes by one column with reduce function

i have edited this post with the specific case:
i have a list of dataframes like this (note that df1 and df2 have a row in common)
df1
index
Date
A
0
2010-06-19
4
1
2010-06-20
3
2
2010-06-21
2
3
2010-06-22
1
4
2012-07-19
5
df2
index
Date
B
0
2012-07-19
5
1
2012-07-20
6
df3
index
Date
C
0
2020-06-19
5
1
2020-06-20
2
2
2020-06-21
9
df_list = [df1, df2, df3]
I would like to merge all dataframes in a single dataframe, without losing rows and placing nan where there are no things to merge. The criteria would be merging them by the column 'Date' (the column should have all the dates of all the merged dataframes, ordered by date).
The resulting dataframe should look like this:
Resulting Dataframe:
index
Date
A
B
C
0
2010-06-19
4
nan
nan
1
2010-06-20
3
nan
nan
2
2010-06-21
2
nan
nan
3
2010-06-22
1
nan
nan
4
2012-07-19
5
5
nan
5
2012-07-20
nan
6
nan
6
2020-06-19
nan
nan
5
7
2020-06-20
nan
nan
2
8
2020-06-21
nan
nan
9
I tried something like this:
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Date'], how='outer'), df_list)
BUT the resulting dataframe is not as expected (i miss some columns and is not ordered by date). I think i am missing something.
Thank you very much
Use pandas.concat(). It takes a list of dataframes, and appends common columns together, filling new columns with NaN as necessary:
new_df = pd.concat([df1, df2, df3])
Output:
>>> new_df
index Date A B C
0 0 2010-06-19 4.0 NaN NaN
1 1 2010-06-20 3.0 NaN NaN
2 2 2010-06-21 2.0 NaN NaN
3 3 2010-06-22 1.0 NaN NaN
0 0 2012-07-19 NaN 5.0 NaN
1 1 2012-07-20 NaN 6.0 NaN
0 0 2020-06-19 NaN NaN 5.0
1 1 2020-06-20 NaN NaN 2.0
2 2 2020-06-21 NaN NaN 9.0
For overlapping data, I had to add the option: Sort = TRUE in the lambda function. Seemed I was missing the order for big dataframes and I was only seeing the nan at the end and start of frames. Thank you all ;-)
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Date'],
how='outer', sort=True), df_list)

How to combine multiple rows in a pandas dataframe which have only 1 non-null entry per column into one row?

I am using json_normalize to parse json entries of a pandas column. But, as an output I am getting a dataframe with multiple rows with each row having only one non-null entry. I want to combine all these rows to one row in pandas.
currency custom.gt custom.eq price.gt price.lt
0 NaN 4.0 NaN NaN NaN
1 NaN NaN NaN 999.0 NaN
2 NaN NaN NaN NaN 199000.0
3 NaN NaN other NaN NaN
4 USD NaN NaN NaN NaN
You can use ffill (forward fill) and bfill (backfill), which are methods for filling NA values in pandas.
# fill NA values
# option 1:
df = df.ffill().bfill()
# option 2:
df = df.fillna(method='ffill').fillna(method='bfill')
print(df)
currency custom.gt custom.eq price.gt price.lt
0 USD 4.0 other 999.0 199000.0
1 USD 4.0 other 999.0 199000.0
2 USD 4.0 other 999.0 199000.0
3 USD 4.0 other 999.0 199000.0
4 USD 4.0 other 999.0 199000.0
You can then drop the duplicated rows using drop_duplicates and keep the first one :
df = df.drop_duplicates(keep='first')
print(df)
currency custom.gt custom.eq price.gt price.lt
0 USD 4.0 other 999.0 199000.0
Depending on how many times you have to repeat the task, I might also take a look at how the JSON file is structured to see if using a dictionary comprehension could help clean things up so that json_normalize can parse it more easily the first time.
you could do
import pandas as pd
from functools import reduce
df = pd.DataFrame.from_dict({"a":["1", None, None],"b" : [None, None, 1], "c":[None, 3, None]})
def red_func(x,y) :
if pd.isna(x) or pd.isnull(x) :
return y
result = [*map( lambda x : reduce(f,x), [list(row) for i, row in df.iterrows()]),]
Outputs :
In [135]: df
Out[135]:
a b c
0 1 NaN NaN
1 None NaN 3.0
2 None 1.0 NaN
In [136]: [*map( lambda x : reduce(f,x), [list(row) for i, row in df.iterrows()]),]
Out[136]: ['1', 3.0, 1.0]

Find a first non NaN value in Pandas

I have a Pandas dataframe such that
|user_id|value|No|
|:-:|:-:|:-:|
|id1|100|1|
|id1|200|2|
|id1|250|3|
|id2|NaN|1|
|id2|100|2|
|id3|400|1|
|id3|NaN|2|
|id3|200|3|
|id4|NaN|1|
|id4|NaN|2|
|id4|300|3|.
Then I want the folloing dataset:
|user_id|value|No|NewNo|
|:-:|:-:|:-:|:-:|
|id1|100|1|1|
|id1|200|2|2|
|id1|250|3|3|
|id2|100|2|1|
|id3|400|1|1|
|id3|NaN|2|2|
|id3|200|3|3|
|id4|300|3|1|
namely, I want to delete NaN values such that the first value of user_id is not NaN value. Thank you.
you can groupby & forward fill the value column. Null values in the transformed data indicate the values from the start for each group that are null. Filter out the rows that are null
df2 = df[df.groupby('user_id').value.ffill().apply(pd.notnull)].copy()
# application of copy here creates a new data frame and allows us to assign
# values to the result (df2). This is needed to create the column `NewNo`
# in the next & final step
# df2 outputs:
user_id value No
0 'id1' 100.0 1
1 'id1' 200.0 2
2 'id1' 250.0 3
4 'id2' 100.0 2
5 'id3' 400.0 1
6 'id3' NaN 2
7 'id3' 200.0 3
10 'id4' 300.0 3
Generate NewNo column using ranking within the group.
df2['NewNo'] = df2.groupby('user_id').No.rank()
# df2 outputs:
user_id value No NewNo
0 'id1' 100.0 1 1.0
1 'id1' 200.0 2 2.0
2 'id1' 250.0 3 3.0
4 'id2' 100.0 2 1.0
5 'id3' 400.0 1 1.0
6 'id3' NaN 2 2.0
7 'id3' 200.0 3 3.0
10 'id4' 300.0 3 1.0
groupby + first_valid_index + cumcount
You can calculate indices for first non-null values by group, then use Boolean indexing:
# use transform to align groupwise first_valid_index with dataframe
firsts = df.groupby('user_id')['value'].transform(pd.Series.first_valid_index)
# apply Boolean filter
res = df[df.index >= firsts]
# use groupby + cumcount to add groupwise labels
res['NewNo'] = res.groupby('user_id').cumcount() + 1
print(res)
user_id value No NewNo
0 id1 100.0 1 1
1 id1 200.0 2 2
2 id1 250.0 3 3
4 id2 100.0 2 1
5 id3 400.0 1 1
6 id3 NaN 2 2
7 id3 200.0 3 3
10 id4 300.0 3 1

Get pandas headers when rows is NaN

I got data from sensors. And some certain period they return blank string to me for no reason!
During the data cleaning. I can manage to get the NaN column using this
df[df.isnull().values.any(axis=1)]
Time IL1 IL2 IL3 IN kVA kW kWh
12463 2018-09-17 10:30:00 63.7 78.4 53.3 25.2 NaN NaN 2039676.0
12464 2018-09-17 11:00:00 64.1 78.6 53.5 25.4 NaN NaN 2039698.0
How can I get kVA and kW out from the DataFrame?
Then I can find the median of kVA and KW from the other rows and replace the NaN with it
My usecase:
Right now I have to read file and find where the NaN columns are. It require my efforts. So I wants to automate that process by replace hardcode on column name.
trdb_a2_2018_df = pd.read_csv(PATH + 'dpm_trdb_a2_2018.csv', thousands=',', parse_dates=['Time'], date_parser=extract_dt)
trdb_a2_2018_df = trdb_a2_2018_df.replace(r'\s+', np.nan, regex=True)
median_kVA = trdb_a2_2018_df['kVA'].median()
trdb_a2_2018_df = trdb_a2_2018_df['kVA'].fillna(median_kVA)
I believe you need fillna with median:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,np.nan],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7.0 1 5 a
1 b 5.0 NaN 3 3 a
2 c 4.0 9.0 5 6 a
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f NaN 3.0 0 4 b
df1 = df.fillna(df.median())
print (df1)
A B C D E F
0 a 4.0 7.0 1 5 a
1 b 5.0 4.0 3 3 a
2 c 4.0 9.0 5 6 a
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 5.0 3.0 0 4 b
If want also fiter NaNs in columns:
m = df.isnull().any()
df.loc[:, m] = df.loc[:, m].fillna(df.loc[:, m].median())
Alternative:
cols = df.columns[df.isnull().any()]
df[cols] = df[cols].fillna(df[cols].median())
Detail:
print (df.median())
B 5.0
C 4.0
D 2.0
E 4.5
dtype: float64
IIUC to filter out the column headers that contain NaN's use:
df.columns[df.isna().any()]
There are two ways for you to solve this question.
Use pandas.DataFrame.fillna to replace the NaN value with a certain value such as 0.
Use pandas.DataFrame.dropna to get a new DataFrame by filter origin DataFrame.
Reference:
Pandas dropna API
Pandas fillna API
Let's assume that this is an initial df:
df = pd.DataFrame([{'kVa': np.nan, 'kW':10.1}, {'kVa': 12.5, 'kW':14.3}, {'kVa': 16.1, 'kW':np.nan}])
In [51]: df
Out[51]:
kVa kW
0 NaN 10.1
1 12.5 14.3
2 16.1 NaN
You can use DataFrames's .fillna() method to replace NaN's and .notna() to get indexes of values other than NaN:
df.kVa.fillna(df.kVa[df.kVa.notna()].median(), inplace=True)
df.kW.fillna(df.kW[df.kW.notna()].median(), inplace=True)
Use inplace=True to avoid creating new Series instance.
Df after these manipulations:
In [54]: df
Out[54]:
kVa kW
0 14.3 10.1
1 12.5 14.3
2 16.1 12.2

Categories