How to remove NaNs and squeeze in a DataFrame - pandas

How to remove NaNs and squeeze in a DataFrame - pandas - python

I was doing some coding and realized something, I think there is an easier way of doing this.
So I have a DataFrame like this:
>>> df = pd.DataFrame({'a': [1, 'A', 2, 'A'], 'b': ['A', 3, 'A', 4]})
a b
0 1 A
1 A 3
2 2 A
3 A 4
And I want to remove all of the As from the data, but I also want to squeeze in the DataFrame, what I mean by squeezing in the DataFrame is to have a result of this:
a b
0 1 3
1 2 4
I have a solution as follows:
a = df['a'][df['a'] != 'A']
b = df['b'][df['b'] != 'A']
df2 = pd.DataFrame({'a': a.tolist(), 'b': b.tolist()})
print(df2)
Which works, but I seem to think there is an easier way, I've stopped coding for a while so not so bright anymore...
Note:
All columns have the same amount of As, there is no problem there.

You can try boolean indexing with loc to remove the A values:
pd.DataFrame({c: df.loc[df[c] != 'A', c].tolist() for c in df})
Result:
a b
0 1 3
1 2 4

This would do:
In [1513]: df.replace('A', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy()))
Out[1513]:
a b
0 1.0 3.0
1 2.0 4.0

We use can df.melt then filter out 'A' values then df.pivot
out = df.melt().query("value!='A'")
out.index = out.groupby('variable')['variable'].cumcount()
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Details
out = df.melt().query("value!='A'")
variable value
0 a 1
2 a 2
5 b 3
7 b 4
# We set this as index so it helps in `df.pivot`
out.groupby('variable')['variable'].cumcount()
0 0
2 1
5 0
7 1
dtype: int64
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Another alternative
df = df.mask(df.eq('A'))
out = df.stack()
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4
Details
df = df.mask(df.eq('A'))
a b
0 1 NaN
1 NaN 3
2 2 NaN
3 NaN 4
out = df.stack()
0 a 1
1 b 3
2 a 2
3 b 4
dtype: object
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4

Related

pandas: How to sort values in order of most repeated to least repeated?

Suppose a df as:
A B ...
2 .
3 .
2 .
3
2
1
I expect output to be:
A B ...
2 .
2 .
2 .
3
3
1
Because 2 was repeated more, then 3 and so on.

This works:
# Suppose you have a df like this:
import pandas as pd
df = pd.DataFrame({'A':[2,3,2,3,2,1], 'B':range(6)})
A B
0 2 0
1 3 1
2 2 2
3 3 3
4 2 4
5 1 5
# you can pass a sorting function to sort_values as key:
df = df.sort_values(by='A', key=lambda x: x.map(x.value_counts()), ascending=False)
A B
0 2 0
2 2 2
4 2 4
1 3 1
3 3 3
5 1 5

This would work
df['Frequency'] = df.groupby('A')['A'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)

Try value_counts and argsort
out = df.iloc[(-df.A.value_counts().reindex(df.A)).argsort()]
Out[647]:
A B ...
0 2 . NaN
2 2 . NaN
4 2 None NaN
1 3 . NaN
3 3 None NaN
5 1 None NaN

First add a new column counting the repetitions:
>>> df['C'] = df.groupby('A')['A'].transform('count')
Then sort by this new column:
>>> df.sort_values(['C','A'], ascending=False)

Pandas - combine columns and put one after another?

I have the following dataframe:
a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6
The desirable output is:
a,b
1,3
2,4
3,5
2,4
3,5
4,6
There is a lot of "a" and "b" named headers in the dataframe, the maximum is a50 and b50. So I am looking for the way to combine them all into just "a" and "b".
I think it's possible to do with concat, but I have no idea how to combine it all, putting all the values under each other. I'll be grateful for any ideas.

You can use pd.wide_to_long:
pd.wide_to_long(df.reset_index(), ['a','b'], 'index', 'No').reset_index()[['a','b']]
Output:
a b
0 1 3
1 2 4
2 3 5
3 2 4
4 3 5
5 4 6

First we read the dataframe:
import pandas as pd
from io import StringIO
s = """a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6"""
df = pd.read_csv(StringIO(s), sep=',')
Then we stack the columns, and separate the number of the columns from the letter 'a' or 'b':
stacked = df.stack().rename("val").reset_index(1).reset_index()
cols_numbers = pd.DataFrame(stacked
.level_1
.str.split('(\d)')
.apply(lambda l: l[:2])
.tolist(),
columns=["col", "num"])
x = cols_numbers.join(stacked[['val', 'index']])
print(x)
col num val index
0 a 1 1 0
1 a 2 2 0
2 b 1 3 0
3 b 2 4 0
4 a 1 2 1
5 a 2 3 1
6 b 1 4 1
7 b 2 5 1
8 a 1 3 2
9 a 2 4 2
10 b 1 5 2
11 b 2 6 2
Finally, we group by index and num to get two columns a and b, and we fill the first row of the b column with the second value, to get what was expected:
result = (x
.set_index("col", append=True)
.groupby(["index", "num"])
.val
.apply(lambda g:
g
.unstack()
.fillna(method="bfill")
.head(1))
.reset_index(-1, drop=True))
print(result)
col a b
index num
0 1 1.0 3.0
2 2.0 4.0
1 1 2.0 4.0
2 3.0 5.0
2 1 3.0 5.0
2 4.0 6.0
To get rid of the multiindex at the end: result.reset_index(drop=True)

Combine data from two columns into one, except if second is already occupied in pandas

Say I have two columns in a data frame, one of which is incomplete.
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b':[5, '', 6, '']})
df
Out:
a b
0 1 5
1 2
2 3 6
3 4
is there a way to fill the empty values in column b with the corresponding values in column a whilst leaving the rest of column b intact?
such that you obtain without iterating over the column?
df
Out:
a b
0 1 5
1 2 2
2 3 6
3 4 4
I think you can use the apply method - but I am not sure. For reference the dataset I'm dealing with is quite large (appx 1GB) which is why iteration - my first attempt was not a good idea.

If blanks are empty strings, you could
In [165]: df.loc[df['b'] == '', 'b'] = df['a']
In [166]: df
Out[166]:
a b
0 1 5
1 2 2
2 3 6
3 4 4
However, if your blanks are NaNs, you could use fillna
In [176]: df
Out[176]:
a b
0 1 5.0
1 2 NaN
2 3 6.0
3 4 NaN
In [177]: df['b'] = df['b'].fillna(df['a'])
In [178]: df
Out[178]:
a b
0 1 5.0
1 2 2.0
2 3 6.0
3 4 4.0

You can use np.where to evaluate df.b, if it's not empty keep its value, otherwise use df.a instead.
df.b=np.where(df.b,df.b,df.a)
df
Out[33]:
a b
0 1 5
1 2 2
2 3 6
3 4 4

You can use pd.Series.where using a boolean version of df.b because '' resolve to False
df.assign(b=df.b.where(df.b.astype(bool), df.a))
a b
0 1 5
1 2 2
2 3 6
3 4 4

You can use replace and ffill with axis=1:
df.replace('',np.nan).ffill(axis=1).astype(df.a.dtypes)
Output:
a b
0 1 5
1 2 2
2 3 6
3 4 4

Loop over groups Pandas Dataframe and get sum/count

I am using Pandas to structure and process Data.
This is my DataFrame:
And this is the code which enabled me to get this DataFrame:
(data[['time_bucket', 'beginning_time', 'bitrate', 2, 3]].groupby(['time_bucket', 'beginning_time', 2, 3])).aggregate(np.mean)
Now I want to have the sum (Ideally, the sum and the count) of my 'bitrates' grouped in the same time_bucket. For example, for the first time_bucket((2016-07-08 02:00:00, 2016-07-08 02:05:00), it must be 93750000 as sum and 25 as count, for all the case 'bitrate'.
I did this :
data[['time_bucket', 'bitrate']].groupby(['time_bucket']).agg(['sum', 'count'])
And this is the result :
But I really want to have all my data in one DataFrame.
Can I do a simple loop over 'time_bucket' and apply a function which calculate the sum of all bitrates ?
Any ideas ? Thx !

I think you need merge, but need same levels of indexes of both DataFrames, so use reset_index. Last get original Multiindex by set_index:
data = pd.DataFrame({'A':[1,1,1,1,1,1],
'B':[4,4,4,5,5,5],
'C':[3,3,3,1,1,1],
'D':[1,3,1,3,1,3],
'E':[5,3,6,5,7,1]})
print (data)
A B C D E
0 1 4 3 1 5
1 1 4 3 3 3
2 1 4 3 1 6
3 1 5 1 3 5
4 1 5 1 1 7
5 1 5 1 3 1
df1 = data[['A', 'B', 'C', 'D','E']].groupby(['A', 'B', 'C', 'D']).aggregate(np.mean)
print (df1)
E
A B C D
1 4 3 1 5.5
3 3.0
5 1 1 7.0
3 3.0
df2 = data[['A', 'C']].groupby(['A'])['C'].agg(['sum', 'count'])
print (df2)
sum count
A
1 12 6
print (pd.merge(df1.reset_index(['B','C','D']), df2, left_index=True, right_index=True)
.set_index(['B','C','D'], append=True))
E sum count
A B C D
1 4 3 1 5.5 12 6
3 3.0 12 6
5 1 1 7.0 12 6
3 3.0 12 6
I try another solution to get output from df1, but this is aggregated so it is impossible get right data. If sum level C, you get 8 instead 12.

Adding a column to pandas data frame fills it with NA

I have this pandas dataframe:
SourceDomain 1 2 3
0 www.theguardian.com profile.theguardian.com 1 Directed
1 www.theguardian.com membership.theguardian.com 2 Directed
2 www.theguardian.com subscribe.theguardian.com 3 Directed
3 www.theguardian.com www.google.co.uk 4 Directed
4 www.theguardian.com jobs.theguardian.com 5 Directed
I would like to add a new column which is a pandas series created like this:
Weights = Weights.value_counts()
However, when I try to add the new column using edgesFile[4] = Weights it fills it with NA instead of the values:
SourceDomain 1 2 3 4
0 www.theguardian.com profile.theguardian.com 1 Directed NaN
1 www.theguardian.com membership.theguardian.com 2 Directed NaN
2 www.theguardian.com subscribe.theguardian.com 3 Directed NaN
3 www.theguardian.com www.google.co.uk 4 Directed NaN
4 www.theguardian.com jobs.theguardian.com 5 Directed NaN
How can I add the new column keeping the values?
Thanks?
Dani

You are getting NaNs because the index of Weights does not match up with the index of edgesFile. If you want Pandas to ignore Weights.index and just paste the values in order then pass the underlying NumPy array instead:
edgesFile[4] = Weights.values
Here is an example which demonstrates the difference:
In [14]: df = pd.DataFrame(np.arange(4)*10, index=list('ABCD'))
In [15]: df
Out[15]:
0
A 0
B 10
C 20
D 30
In [16]: s = pd.Series(np.arange(4), index=list('CDEF'))
In [17]: s
Out[17]:
C 0
D 1
E 2
F 3
dtype: int64
Here we see Pandas aligning the index:
In [18]: df[4] = s
In [19]: df
Out[19]:
0 4
A 0 NaN
B 10 NaN
C 20 0
D 30 1
Here, Pandas simply pastes the values in s into the column:
In [20]: df[4] = s.values
In [21]: df
Out[21]:
0 4
A 0 0
B 10 1
C 20 2
D 30 3

This is small example of your question:
You can add new column with a column name in existing DataFrame
>>> df = DataFrame([[1,2,3],[4,5,6]], columns = ['A', 'B', 'C'])
>>> df
A B C
0 1 2 3
1 4 5 6
>>> s = Series([7,8])
>>> s
0 7
1 8
2 9
>>> df['D']=s
>>> df
A B C D
0 1 2 3 7
1 4 5 6 8
Or, You can make DataFrame from Series and concat then
>>> df = DataFrame([[1,2,3],[4,5,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> s = DataFrame(Series([7,8]), columns=['4']) # if you don't provide column name, default name will be 0
>>> s
0
0 7
1 8
>>> df = pd.concat([df,s], axis=1)
>>> df
0 1 2 0
0 1 2 3 7
1 4 5 6 8
Hope this will help

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove NaNs and squeeze in a DataFrame - pandas - python

You can try boolean indexing with loc to remove the A values: pd.DataFrame({c: df.loc[df[c] != 'A', c].tolist() for c in df}) Result: a b 0 1 3 1 2 4

This would do: In [1513]: df.replace('A', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy())) Out[1513]: a b 0 1.0 3.0 1 2.0 4.0

Related

pandas: How to sort values in order of most repeated to least repeated?

Pandas - combine columns and put one after another?

Combine data from two columns into one, except if second is already occupied in pandas

Loop over groups Pandas Dataframe and get sum/count

Adding a column to pandas data frame fills it with NA

Categories

Resources