Dataframe from a dict of lists of dicts? - python

I have a dict of lists of dicts. What is the most efficient way to convert this into a DataFrame in pandas?
data = {
"0a2":[{"a":1,"b":1},{"a":1,"b":1,"c":1},{"a":1,"b":1}],
"279":[{"a":1,"b":1,"c":1},{"a":1,"b":1,"d":1}],
"ae2":[{"a":1,"b":1},{"a":1,"d":1},{"a":1,"b":1},{"a":1,"d":1}],
#...
}
import pandas as pd
pd.DataFrame(data, columns=["a","b","c","d"])
What I've tried:
One solution is to denormalize the data like this, by duplicating the "id" keys:
bad_data = [
{"a":1,"b":1,"id":"0a2"},{"a":1,"b":1,"c":1,"id":"0a2"},{"a":1,"b":1,"id":"0a2"},
{"a":1,"b":1,"c":1,"id":"279"},{"a":1,"b":1,"d":1,"id":"279"},
{"a":1,"b":1,"id":"ae2"},{"a":1,"d":1,"id":"ae2"},{"a":1,"b":1,"id":"ae2"},{"a":1,"d":1,"id":"ae2"}
]
pd.DataFrame(bad_data, columns=["a","b","c","d","id"])
But my data is very large, so I'd prefer some other hierarchical index solution.

IIUC, you can do (remcomended)
new_df = pd.concat((pd.DataFrame(d) for d in data.values()), keys=data.keys())
Output:
a b c d
0a2 0 1 1.0 NaN NaN
1 1 1.0 1.0 NaN
2 1 1.0 NaN NaN
279 0 1 1.0 1.0 NaN
1 1 1.0 NaN 1.0
ae2 0 1 1.0 NaN NaN
1 1 NaN NaN 1.0
2 1 1.0 NaN NaN
3 1 NaN NaN 1.0
Or
pd.concat(pd.DataFrame(v).assign(ID=k) for k,v in data.items())
Output:
a b c ID d
0 1 1.0 NaN 0a2 NaN
1 1 1.0 1.0 0a2 NaN
2 1 1.0 NaN 0a2 NaN
0 1 1.0 1.0 279 NaN
1 1 1.0 NaN 279 1.0
0 1 1.0 NaN ae2 NaN
1 1 NaN NaN ae2 1.0
2 1 1.0 NaN ae2 NaN
3 1 NaN NaN ae2 1.0

Related

Merging dataframes in pandas while ignoring NaN values

Suppose I have two dataframes:
df_a
A B C
0 1.0 NaN NaN
1 NaN 1.0 NaN
2 NaN NaN 1.0
df_b
A B C
0 NaN NaN 2.0
1 NaN 2.0 NaN
2 2.0 NaN NaN
How would I go about merging/concatenating them so the result dataframe looks like this:
df_c
A B C
0 1.0 NaN 2.0
1 NaN 2.0 NaN
2 2.0 NaN 1.0
The way I got closer conceptually was by using pd.merge(df_a, df_b, "right"), but all values on df_a ended up replaced.
Is there any way to ignore NaN values when merging?
In your case do combine_first
df_c = df_b.combine_first(df_a)
df_c
Out[151]:
A B C
0 1.0 NaN 2.0
1 NaN 2.0 NaN
2 2.0 NaN 1.0
df_c = df_a.combine_first(df_b)
df_c
A B C
0 1.0 NaN 2.0
1 NaN 1.0 NaN
2 2.0 NaN 1.0
df_d = df_b.combine_first(df_a)
df_d
A B C
0 1.0 NaN 2.0
1 NaN 2.0 NaN
2 2.0 NaN 1.0

How to reset cumulative sum every time there is a NaN in a pandas dataframe?

If I have a Pandas data frame like this:
1 2 3 4 5 6 7
1 NaN 1 1 1 NaN 1 1
2 NaN NaN 1 1 1 1 1
3 NaN NaN NaN 1 NaN 1 1
4 1 1 NaN NaN 1 1 NaN
How do I do a cumulative sum such that the count resets every time there is a NaN value in the row? Such that I get something like this:
1 2 3 4 5 6 7
1 NaN 1 2 3 NaN 1 2
2 NaN NaN 1 2 3 4 5
3 NaN NaN NaN 1 NaN 1 2
4 1 2 NaN NaN 1 2 NaN
You could do:
# compute mask where np.nan = True
mask = pd.isna(df).astype(bool)
# compute cumsum across rows fillna with ffill
cumulative = df.cumsum(1).fillna(method='ffill', axis=1).fillna(0)
# get the values of cumulative where nan is True use the same method
restart = cumulative[mask].fillna(method='ffill', axis=1).fillna(0)
# set the result
result = (cumulative - restart)
result[mask] = np.nan
# display the result
print(result)
Output
1 2 3 4 5 6 7
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN
You can do with stack and unstack
s=df.stack(dropna=False).isnull().cumsum()
df=df.where(df.isnull(),s.groupby(s).cumcount().unstack())
df
Out[86]:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1 2.0
2 NaN NaN 1.0 2.0 3.0 4 5.0
3 NaN NaN NaN 1.0 NaN 1 2.0
4 3.0 4.0 NaN NaN 1.0 2 NaN
I came up with a slightly different answer here that might be helpful.
For as single series I made this function to to do the cumsum-reset on nulls.
def cumsum_reset_on_null(srs: pd.Series) -> pd.Series:
"""
For a pandas series with null values,
do a cumsum and reset the cumulative sum when a null value is encountered.
Example)
input: [1, 1, np.nan, 1, 2, 3, np.nan, 1, np.nan]
return: [1, 2, 0, 1, 3, 6, 0, 1, 0]
"""
cumulative = srs.cumsum().fillna(method='ffill')
restart = ((cumulative * srs.isnull()).replace(0.0, np.nan)
.fillna(method='ffill').fillna(0))
result = (cumulative - restart)
return result.replace(0, np.nan)
Then for the full dataframe, just apply this function row-wise
df = pd.DataFrame([
[np.nan, 1, 1, 1, np.nan, 1, 1],
[np.nan, np.nan, 1, 1, 1, 1, 1],
[np.nan, np.nan, np.nan, 1, np.nan, 1, 1],
[1, 1, np.nan, np.nan, 1, 1, np.nan],
])
df.apply(cumsum_reset_on_null, axis=1)
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN
One of the way can be:
sample = pd.DataFrame({1:[np.nan,np.nan,np.nan,1],2:[1,np.nan,np.nan,1],3:[1,1,np.nan,np.nan],4:[1,1,1,np.nan],5:[np.nan,1,np.nan,1],6:[1,1,1,1],7:[1,1,1,np.nan]},index=[1,2,3,4])
Output of sample
1 2 3 4 5 6 7
1 NaN 1.0 1.0 1.0 NaN 1 1.0
2 NaN NaN 1.0 1.0 1.0 1 1.0
3 NaN NaN NaN 1.0 NaN 1 1.0
4 1.0 1.0 NaN NaN 1.0 1 NaN
Following code would do:
#numr = number of rows
#numc = number of columns
numr,numc = sample.shape
for i in range(numr):
s=0
flag=0
for j in range(numc):
if np.isnan(sample.iloc[i,j]):
flag=1
else:
if flag==1:
s=sample.iloc[i,j]
flag=0
else:
s+=sample.iloc[i,j]
sample.iloc[i,j]=s
Output:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1.0 2.0
2 NaN NaN 1.0 2.0 3.0 4.0 5.0
3 NaN NaN NaN 1.0 NaN 1.0 2.0
4 1.0 2.0 NaN NaN 1.0 2.0 NaN

How to assign numerical value to each new grouping in a pandas data frame row?

If I have a Pandas Data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 1 1
2 1 NaN NaN 1 NaN 1
3 NaN 1 1 NaN 1 1
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
How do I count each group of ones and assign a value based on the number of groups in each row? Such that I get a data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 2 2
2 1 NaN NaN 2 NaN 3
3 NaN 1 NaN NaN 2 2
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
It is a little bit hard to finding a simple way
s=df.isnull().cumsum(1) # cumsum get the null
s=s[df.notnull()].apply(lambda x : pd.factorize(x)[0],1)+1 # then we need assign the groukey
df=s.mask(s==0)# and mask 0 as NaN
df
0 1 2 3 4 5
1 NaN NaN 1.0 NaN 2.0 2.0
2 1.0 NaN NaN 2.0 NaN 3.0
3 NaN 1.0 1.0 NaN 2.0 2.0
4 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN NaN NaN NaN NaN NaN

Pandas: Drop Rows, Columns If More Than Half Are NaN

I have a Pandas DataFrame called df with 1,460 rows and 81 columns. I want to remove all columns where at least half the entries are NaN and to do something similar for rows.
From the Pandas docs, I attempted this:
train_df.shape //(1460, 81)
train_df.dropna(thresh=len(train_df)/2, axis=1, inplace=True)
train_df.shape //(1460, 77)
Is this the correct way of doing it? It seems to remove 4 columns but I'm surprised. I would have thought len(train_df) gets me the number of rows so I've passed the wrong value to thresh...?
How would I do the same thing for rows (removing rows where at least half the columns are NaN)?
Thanks!
I guess you did the right thing but forgot to add the .index.
The line should look like this:
train_df.dropna(thresh=len(train_df.index)/2, axis=1, inplace=True)
Hope that helps.
Using count and loc. count(axis=) ignores NaNs for counting.
In [4135]: df.loc[df.count(1) > df.shape[1]/2, df.count(0) > df.shape[0]/2]
Out[4135]:
0
0 0.382991
1 0.428040
7 0.441113
Details
In [4136]: df
Out[4136]:
0 1 2 3
0 0.382991 0.658090 0.881214 0.572673
1 0.428040 0.258378 0.865269 0.173278
2 0.579953 NaN NaN NaN
3 0.117927 NaN NaN NaN
4 0.597632 NaN NaN NaN
5 0.547839 NaN NaN NaN
6 0.998631 NaN NaN NaN
7 0.441113 0.527205 0.779821 0.251350
In [4137]: df.count(1) > df.shape[1]/2
Out[4137]:
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 True
dtype: bool
In [4138]: df.count(0) < df.shape[0]/2
Out[4138]:
0 False
1 True
2 True
3 True
dtype: bool
Setup
np.random.seed([3,14159])
df = pd.DataFrame(np.random.choice([1, np.nan], size=(10, 10)))
df
0 1 2 3 4 5 6 7 8 9
0 1.0 1.0 NaN NaN NaN 1.0 1.0 NaN 1.0 NaN
1 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0
2 NaN 1.0 1.0 NaN NaN NaN NaN 1.0 1.0 1.0
3 1.0 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN
4 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN 1.0 NaN
5 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0
6 NaN NaN 1.0 NaN NaN 1.0 1.0 NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN 1.0 NaN 1.0 NaN NaN
8 1.0 1.0 1.0 NaN 1.0 NaN 1.0 NaN NaN 1.0
9 NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Solution 1
This assumes you make the calculation for rows and columns before you drop either rows or columns.
n = df.notnull()
df.loc[n.mean(1) > .5, n.mean() > .5]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Solution 2
Similar concept but using numpy tools.
v = np.isnan(df.values)
r = np.count_nonzero(v, 1) < v.shape[1] // 2
c = np.count_nonzero(v, 0) < v.shape[0] // 2
df.loc[r, c]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Try this code, it would do !
df.dropna(thresh = df.shape[1]/3, axis = 0, inplace = True)

Pandas: operations with nans

Suppose I produce the following using pandas:
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(25).reshape((5,5))*2,index = ['A','B','C','D','E'])
df[2] = np.nan
df1[3] = np.nan
df[4] = np.nan
df1[4] = np.nan
df2 = df+df1
print(df2)
0 1 2 3 4
A 3.0 3.0 NaN NaN NaN
B 3.0 3.0 NaN NaN NaN
C 3.0 3.0 NaN NaN NaN
D 3.0 3.0 NaN NaN NaN
E 3.0 3.0 NaN NaN NaN
What do I have to do to get this instead?
0 1 2 3 4
A 3 3 2 1 NaN
B 3 3 2 1 NaN
C 3 3 2 1 NaN
D 3 3 2 1 NaN
E 3 3 2 1 NaN
Use the fill_value argument of the DataFrame.add method:
fill_value : None or float value, default None Fill missing (NaN)
values with this value. If both DataFrame locations are missing, the
result will be missing.
df.add(df1, fill_value=0)
Out:
0 1 2 3 4
A 3.0 3.0 2.0 1.0 NaN
B 3.0 3.0 2.0 1.0 NaN
C 3.0 3.0 2.0 1.0 NaN
D 3.0 3.0 2.0 1.0 NaN
E 3.0 3.0 2.0 1.0 NaN

Categories