How to combine dataframe rows - python

I have the following code:
import os
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
fileName= input("Enter file name here (Case Sensitve) > ")
df = pd.read_excel(fileName +'.xlsx', sheetname=None, ignore_index=True)
xl = pd.ExcelFile(fileName +'.xlsx')
SystemCount= len(xl.sheet_names)
df1 = pd.DataFrame([])
for y in range(1, int(SystemCount)+ 1):
df = pd.read_excel(xl,'System ' + str(y))
df['System {0}'.format(y)] = "1"
df1 = df1.append(df)
df1 = df1.sort_values(['Email'])
df = df1['Email'].value_counts()
df1['Count'] = df1.groupby('Email')['Email'].transform('count')
print(df1)
Which prints something like this:
Email System 1 System 2 System 3 System 4 Count
test_1_#test.com NaN 1 NaN NaN 1
test_2_#test.com NaN NaN 1 NaN 3
test_2_#test.com 1 NaN NaN NaN 3
test_2_#test.com NaN NaN NaN 1 3
test_3_#test.com NaN 1 NaN NaN 1
test_4_#test.com NaN NaN 1 NaN 1
test_5_#test.com 1 NaN NaN NaN 3
test_5_#test.com NaN NaN 1 NaN 3
test_5_#test.com NaN NaN NaN 1 3
How do I combine this, so the email only shows once, with all marked systems?
I would like the output to look like this:
System1 System2 System3 System4 Count
Email
test_1_#test.com 0.0 1.0 0.0 0.0 1
test_2_#test.com 1.0 0.0 1.0 1.0 3
test_3_#test.com 0.0 1.0 0.0 0.0 1
test_4_#test.com 0.0 0.0 1.0 0.0 1
test_5_#test.com 1.0 0.0 1.0 1.0 3

If I understand it clearly
df1=df1.apply(lambda x : pd.to_numeric(x,errors='ignore'))
d=dict(zip(df1.columns[1:],['sum']*df1.columns[1:].str.contains('System').sum()+['first']))
df1.fillna(0).groupby('Email').agg(d)
Out[95]:
System1 System2 System3 System4 Count
Email
test_1_#test.com 0.0 1.0 0.0 0.0 1
test_2_#test.com 1.0 0.0 1.0 1.0 3
test_3_#test.com 0.0 1.0 0.0 0.0 1
test_4_#test.com 0.0 0.0 1.0 0.0 1
test_5_#test.com 1.0 0.0 1.0 1.0 3

It'd be easier to get help if you would post code to generate your input data.
But you probably want a GroupBy:
df2 = df1.groupby('Email').sum()

Related

How to count number of rows dropped in a pandas dataframe

How do I print the number of rows dropped while executing the following code in python:
df.dropna(inplace = True)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
len(new_data)
Use:
np.random.seed(2022)
df = pd.DataFrame(np.random.choice([0,np.nan, 1], size=(10, 3)))
print (df)
0 1 2
0 NaN 0.0 NaN
1 0.0 NaN NaN
2 0.0 0.0 1.0
3 0.0 0.0 NaN
4 NaN NaN 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
7 NaN 0.0 1.0
8 1.0 1.0 NaN
9 1.0 0.0 NaN
You can count missing values before by DataFrame.isna with DataFrame.any and sum:
count = df.isna().any(axis=1).sum()
df.dropna(inplace = True)
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
Or get difference of size Dataframe before and after dropna:
orig = df.shape[0]
df.dropna(inplace = True)
count = orig - df.shape[0]
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7

Best way to reassemble a pandas data frame

Need to reassemble a data frame that is the result of a group by operation. It is assumed to be ordered.
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 NaN NaN 2 NaN
2 1.0 1.0 1 NaN
3 NaN NaN 2 NaN
4 NaN NaN 3 NaN
5 2.0 3.0 1 NaN
6 NaN NaN 2 2.0
And looking for something like this
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0
Wondering if there is an elegant way to resolve it.
import pandas as pd
import numpy as np
def refill_frame(df, cols):
while df[cols].isnull().values.any():
for col in cols:
if col in list(df):
#print (col)
df[col]= np.where(df[col].isnull(), df[col].shift(1), df[col])
return df
df = pd.DataFrame({'Major': [0, None, 1, None, None,2, None],
'Minor': [0, None, 1, None, None,3, None],
'RelType': [1, 2, 1, 2,3, 1,2],
'SomeNulls': [1, None,None, None,None,None,2]
})
print (df)
cols2fill =['Major', 'Minor']
df = refill_frame(df, cols2fill)
print (df)
If I understand the question correctly, You could do a transform on the specific columns:
df.loc[:, ['Major', 'Minor']] = df.loc[:, ['Major', 'Minor']].transform('ffill')
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0
You could also use the fill_direction function from pyjanitor:
# pip install pyjanitor
import janitor
df.fill_direction({"Major":"down", "Minor":"down"})
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0

Dataframe from a dict of lists of dicts?

I have a dict of lists of dicts. What is the most efficient way to convert this into a DataFrame in pandas?
data = {
"0a2":[{"a":1,"b":1},{"a":1,"b":1,"c":1},{"a":1,"b":1}],
"279":[{"a":1,"b":1,"c":1},{"a":1,"b":1,"d":1}],
"ae2":[{"a":1,"b":1},{"a":1,"d":1},{"a":1,"b":1},{"a":1,"d":1}],
#...
}
import pandas as pd
pd.DataFrame(data, columns=["a","b","c","d"])
What I've tried:
One solution is to denormalize the data like this, by duplicating the "id" keys:
bad_data = [
{"a":1,"b":1,"id":"0a2"},{"a":1,"b":1,"c":1,"id":"0a2"},{"a":1,"b":1,"id":"0a2"},
{"a":1,"b":1,"c":1,"id":"279"},{"a":1,"b":1,"d":1,"id":"279"},
{"a":1,"b":1,"id":"ae2"},{"a":1,"d":1,"id":"ae2"},{"a":1,"b":1,"id":"ae2"},{"a":1,"d":1,"id":"ae2"}
]
pd.DataFrame(bad_data, columns=["a","b","c","d","id"])
But my data is very large, so I'd prefer some other hierarchical index solution.
IIUC, you can do (remcomended)
new_df = pd.concat((pd.DataFrame(d) for d in data.values()), keys=data.keys())
Output:
a b c d
0a2 0 1 1.0 NaN NaN
1 1 1.0 1.0 NaN
2 1 1.0 NaN NaN
279 0 1 1.0 1.0 NaN
1 1 1.0 NaN 1.0
ae2 0 1 1.0 NaN NaN
1 1 NaN NaN 1.0
2 1 1.0 NaN NaN
3 1 NaN NaN 1.0
Or
pd.concat(pd.DataFrame(v).assign(ID=k) for k,v in data.items())
Output:
a b c ID d
0 1 1.0 NaN 0a2 NaN
1 1 1.0 1.0 0a2 NaN
2 1 1.0 NaN 0a2 NaN
0 1 1.0 1.0 279 NaN
1 1 1.0 NaN 279 1.0
0 1 1.0 NaN ae2 NaN
1 1 NaN NaN ae2 1.0
2 1 1.0 NaN ae2 NaN
3 1 NaN NaN ae2 1.0

How to i transform a very large dataframe to get the count of values in all columns (without using df.stack or df.apply)

I am working with a very large dataframe (~3 million rows) and i need the count of values from multiple columns, grouped by time related data.
I have tried to stack the columns but the resulting dataframe was very long and wouldn't fit in the memory. Similarly df.apply gave memory issues.
For example if my sample dataframe is like,
id,date,field1,field2,field3
1,1/1/2014,abc,,abc
2,1/1/2014,abc,,abc
3,1/2/2014,,abc,abc
4,1/4/2014,xyz,abc,
1,1/1/2014,,abc,abc
1,1/1/2014,xyz,qwe,xyz
4,1/7/2014,,qwe,abc
2,1/4/2014,qwe,,qwe
2,1/4/2014,qwe,abc,qwe
2,1/5/2014,abc,,abc
3,1/5/2014,xyz,xyz,
I have written the following script that does the needed for a small sample but fails in a large dataframe.
df.set_index(["id", "date"], inplace=True)
df = df.stack(level=[0])
df = df.groupby(level=[0,1]).value_counts()
df = df.unstack(level=[1,2])
I also have a solution via apply but it has the same complications.
The expected result is,
date 1/1/2014 1/4/2014 ... 1/5/2014 1/4/2014 1/7/2014
abc xyz qwe qwe ... xyz xyz abc qwe
id ...
1 4.0 2.0 1.0 NaN ... NaN NaN NaN NaN
2 2.0 NaN NaN 4.0 ... NaN NaN NaN NaN
3 NaN NaN NaN NaN ... 2.0 NaN NaN NaN
4 NaN NaN NaN NaN ... NaN 1.0 1.0 1.0
I am looking for a more optimized version of what I have written.
Thanks for the help !!
You don't want to use stack. Therefore, another solution is using crosstab on id with each date and fields columns. Finally, concat them together, groupby() the index and sum. Use listcomp on df.columns[2:] to create each crosstab (note: I assume the first 2 columns is id and date as your sample):
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum()
Out[497]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 0.0 0.0 0.0 1.0 4.0 0.0 2.0 0.0 0.0 0.0
3 0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
4 0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0
I think showing 0 is better than NaN. However, if you want NaN instead of 0, you just need to chain additional replace as follows:
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum().replace({0: np.nan})
Out[501]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4.0 1.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 NaN NaN NaN 1.0 4.0 NaN 2.0 NaN NaN NaN
3 NaN NaN NaN 2.0 NaN NaN NaN NaN 2.0 NaN NaN
4 NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN 1.0 1.0

Pandas: Drop Rows, Columns If More Than Half Are NaN

I have a Pandas DataFrame called df with 1,460 rows and 81 columns. I want to remove all columns where at least half the entries are NaN and to do something similar for rows.
From the Pandas docs, I attempted this:
train_df.shape //(1460, 81)
train_df.dropna(thresh=len(train_df)/2, axis=1, inplace=True)
train_df.shape //(1460, 77)
Is this the correct way of doing it? It seems to remove 4 columns but I'm surprised. I would have thought len(train_df) gets me the number of rows so I've passed the wrong value to thresh...?
How would I do the same thing for rows (removing rows where at least half the columns are NaN)?
Thanks!
I guess you did the right thing but forgot to add the .index.
The line should look like this:
train_df.dropna(thresh=len(train_df.index)/2, axis=1, inplace=True)
Hope that helps.
Using count and loc. count(axis=) ignores NaNs for counting.
In [4135]: df.loc[df.count(1) > df.shape[1]/2, df.count(0) > df.shape[0]/2]
Out[4135]:
0
0 0.382991
1 0.428040
7 0.441113
Details
In [4136]: df
Out[4136]:
0 1 2 3
0 0.382991 0.658090 0.881214 0.572673
1 0.428040 0.258378 0.865269 0.173278
2 0.579953 NaN NaN NaN
3 0.117927 NaN NaN NaN
4 0.597632 NaN NaN NaN
5 0.547839 NaN NaN NaN
6 0.998631 NaN NaN NaN
7 0.441113 0.527205 0.779821 0.251350
In [4137]: df.count(1) > df.shape[1]/2
Out[4137]:
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 True
dtype: bool
In [4138]: df.count(0) < df.shape[0]/2
Out[4138]:
0 False
1 True
2 True
3 True
dtype: bool
Setup
np.random.seed([3,14159])
df = pd.DataFrame(np.random.choice([1, np.nan], size=(10, 10)))
df
0 1 2 3 4 5 6 7 8 9
0 1.0 1.0 NaN NaN NaN 1.0 1.0 NaN 1.0 NaN
1 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0
2 NaN 1.0 1.0 NaN NaN NaN NaN 1.0 1.0 1.0
3 1.0 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN
4 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN 1.0 NaN
5 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0
6 NaN NaN 1.0 NaN NaN 1.0 1.0 NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN 1.0 NaN 1.0 NaN NaN
8 1.0 1.0 1.0 NaN 1.0 NaN 1.0 NaN NaN 1.0
9 NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Solution 1
This assumes you make the calculation for rows and columns before you drop either rows or columns.
n = df.notnull()
df.loc[n.mean(1) > .5, n.mean() > .5]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Solution 2
Similar concept but using numpy tools.
v = np.isnan(df.values)
r = np.count_nonzero(v, 1) < v.shape[1] // 2
c = np.count_nonzero(v, 0) < v.shape[0] // 2
df.loc[r, c]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Try this code, it would do !
df.dropna(thresh = df.shape[1]/3, axis = 0, inplace = True)

Categories