Pandas - combine columns and put one after another? - python

I have the following dataframe:
a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6
The desirable output is:
a,b
1,3
2,4
3,5
2,4
3,5
4,6
There is a lot of "a" and "b" named headers in the dataframe, the maximum is a50 and b50. So I am looking for the way to combine them all into just "a" and "b".
I think it's possible to do with concat, but I have no idea how to combine it all, putting all the values under each other. I'll be grateful for any ideas.

You can use pd.wide_to_long:
pd.wide_to_long(df.reset_index(), ['a','b'], 'index', 'No').reset_index()[['a','b']]
Output:
a b
0 1 3
1 2 4
2 3 5
3 2 4
4 3 5
5 4 6

First we read the dataframe:
import pandas as pd
from io import StringIO
s = """a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6"""
df = pd.read_csv(StringIO(s), sep=',')
Then we stack the columns, and separate the number of the columns from the letter 'a' or 'b':
stacked = df.stack().rename("val").reset_index(1).reset_index()
cols_numbers = pd.DataFrame(stacked
.level_1
.str.split('(\d)')
.apply(lambda l: l[:2])
.tolist(),
columns=["col", "num"])
x = cols_numbers.join(stacked[['val', 'index']])
print(x)
col num val index
0 a 1 1 0
1 a 2 2 0
2 b 1 3 0
3 b 2 4 0
4 a 1 2 1
5 a 2 3 1
6 b 1 4 1
7 b 2 5 1
8 a 1 3 2
9 a 2 4 2
10 b 1 5 2
11 b 2 6 2
Finally, we group by index and num to get two columns a and b, and we fill the first row of the b column with the second value, to get what was expected:
result = (x
.set_index("col", append=True)
.groupby(["index", "num"])
.val
.apply(lambda g:
g
.unstack()
.fillna(method="bfill")
.head(1))
.reset_index(-1, drop=True))
print(result)
col a b
index num
0 1 1.0 3.0
2 2.0 4.0
1 1 2.0 4.0
2 3.0 5.0
2 1 3.0 5.0
2 4.0 6.0
To get rid of the multiindex at the end: result.reset_index(drop=True)

Related

i cant find the min value(which is>0) in each row in selected columns df[df[col]>0]

this is my data and i want to find the min value of selected columns(a,b,c,d) in each row then calculate the difference between that and dd. I need to ignore 0 in rows, I mean in the first row i need to find 8
need to ignore 0 in rows
Then just replace it with nan, consider following simple example
import numpy as np
import pandas as pd
df = pd.DataFrame({"A":[1,2,0],"B":[3,5,7],"C":[7,0,7]})
df.replace(0,np.nan).apply(min)
df["minvalue"] = df.replace(0,np.nan).apply("min",axis=1)
print(df)
gives output
A B C minvalue
0 1 3 7 1.0
1 2 5 0 2.0
2 0 7 7 7.0
You can use pandas.apply with axis=1 and all column ['a','b','c','d'] convert to Series then replace 0 with +inf and find min. At the end compute diff min with colmun 'dd'.
import numpy as np
df['min_dd'] = df.apply(lambda row: min(pd.Series(row[['a','b','c','d']]).replace(0,np.inf)) - row['d'], axis=1)
print(df)
a b c d dd min_dd
0 0 15 0 8 6 2.0 # min_without_zero : 8 , dd : 6 -> 8-6=2
1 2 0 5 3 2 0.0 # min_without_zero : 2 , dd : 2 -> 2-2=0
2 5 3 3 0 2 1.0 # 3 - 2
3 0 2 3 4 2 0.0 # 2 - 2
You can try
cols = ['a','b','c','d']
df['res'] = df[cols][df[cols].ne(0)].min(axis=1) - df['dd']
print(df)
a b c d dd res
0 0 15 0 8 6 2.0
1 2 0 5 3 2 0.0
2 5 3 3 0 2 1.0
3 2 3 4 4 2 0.0

Keep all cells above given value in pandas DataFrame

I would like to discard all cells that contain a value below a given value. So not only the rows or only the columns that, but for for all cells.
Tried code below, where all values in each cell should be at least 3. Doesn't work.
df[(df >= 3).any(axis=1)]
Example
import pandas as pd
my_dict = {'A':[1,5,6,2],'B':[9,9,1,2],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 2 2 5
I want to keep only the cells that are at least 3.
If you want "all values in each cell should be at least 3"
df [df < 3] = 3
df
A B C
0 3 9 3
1 5 9 3
2 6 3 3
3 3 3 5
If you want "to keep only the cells that are at least 3"
df = df [df >= 3]
df
A B C
0 NaN 9.0 NaN
1 5.0 9.0 NaN
2 6.0 NaN 3.0
3 3.0 3.0 5.0
You can check if the value is >= 3 then drop all rows with NaN value.
df[df >= 3 ].dropna()
DEMO:
import pandas as pd
my_dict = {'A':[1,5,6,3],'B':[9,9,1,3],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 3 3 5
df = df[df >= 3 ].dropna().reset_index(drop=True)
df
A B C
0 3.0 3.0 5.0

append two data frames with unequal columns

I am trying to append two dataframes in pandas which have two different no of columns.
Example:
df1
A B
1 1
2 2
3 3
df2
A
4
5
Expected concatenated dataframe
df
A B
1 1
2 2
3 3
4 Null(or)0
5 Null(or)0
I am using
df1.append(df2) when the columns are same. But no idea how to deal with unequal no of columns.
How about pd.concat?
>>> pd.concat([df1,df2])
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
Also, df1.append(df2) still works:
>>> df1.append(df2)
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
From the docs of df.append:
Columns not in this frame are added as new columns.
Use the concat to join two columns and pass the additional argument ignore_index=True to reset the index other wise you might end with indexes as 0 1 2 0 1. For additional information refer docs here:
df1 = pd.DataFrame({'A':[1,2,3], 'B':[1,2,3]})
df2 = pd.DataFrame({'A':[4,5]})
df = pd.concat([df1,df2],ignore_index=True)
df
Output:
without ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
with ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 4 NaN
4 5 NaN

Combine data from two columns into one, except if second is already occupied in pandas

Say I have two columns in a data frame, one of which is incomplete.
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b':[5, '', 6, '']})
df
Out:
a b
0 1 5
1 2
2 3 6
3 4
is there a way to fill the empty values in column b with the corresponding values in column a whilst leaving the rest of column b intact?
such that you obtain without iterating over the column?
df
Out:
a b
0 1 5
1 2 2
2 3 6
3 4 4
I think you can use the apply method - but I am not sure. For reference the dataset I'm dealing with is quite large (appx 1GB) which is why iteration - my first attempt was not a good idea.
If blanks are empty strings, you could
In [165]: df.loc[df['b'] == '', 'b'] = df['a']
In [166]: df
Out[166]:
a b
0 1 5
1 2 2
2 3 6
3 4 4
However, if your blanks are NaNs, you could use fillna
In [176]: df
Out[176]:
a b
0 1 5.0
1 2 NaN
2 3 6.0
3 4 NaN
In [177]: df['b'] = df['b'].fillna(df['a'])
In [178]: df
Out[178]:
a b
0 1 5.0
1 2 2.0
2 3 6.0
3 4 4.0
You can use np.where to evaluate df.b, if it's not empty keep its value, otherwise use df.a instead.
df.b=np.where(df.b,df.b,df.a)
df
Out[33]:
a b
0 1 5
1 2 2
2 3 6
3 4 4
You can use pd.Series.where using a boolean version of df.b because '' resolve to False
df.assign(b=df.b.where(df.b.astype(bool), df.a))
a b
0 1 5
1 2 2
2 3 6
3 4 4
You can use replace and ffill with axis=1:
df.replace('',np.nan).ffill(axis=1).astype(df.a.dtypes)
Output:
a b
0 1 5
1 2 2
2 3 6
3 4 4

Loop over groups Pandas Dataframe and get sum/count

I am using Pandas to structure and process Data.
This is my DataFrame:
And this is the code which enabled me to get this DataFrame:
(data[['time_bucket', 'beginning_time', 'bitrate', 2, 3]].groupby(['time_bucket', 'beginning_time', 2, 3])).aggregate(np.mean)
Now I want to have the sum (Ideally, the sum and the count) of my 'bitrates' grouped in the same time_bucket. For example, for the first time_bucket((2016-07-08 02:00:00, 2016-07-08 02:05:00), it must be 93750000 as sum and 25 as count, for all the case 'bitrate'.
I did this :
data[['time_bucket', 'bitrate']].groupby(['time_bucket']).agg(['sum', 'count'])
And this is the result :
But I really want to have all my data in one DataFrame.
Can I do a simple loop over 'time_bucket' and apply a function which calculate the sum of all bitrates ?
Any ideas ? Thx !
I think you need merge, but need same levels of indexes of both DataFrames, so use reset_index. Last get original Multiindex by set_index:
data = pd.DataFrame({'A':[1,1,1,1,1,1],
'B':[4,4,4,5,5,5],
'C':[3,3,3,1,1,1],
'D':[1,3,1,3,1,3],
'E':[5,3,6,5,7,1]})
print (data)
A B C D E
0 1 4 3 1 5
1 1 4 3 3 3
2 1 4 3 1 6
3 1 5 1 3 5
4 1 5 1 1 7
5 1 5 1 3 1
df1 = data[['A', 'B', 'C', 'D','E']].groupby(['A', 'B', 'C', 'D']).aggregate(np.mean)
print (df1)
E
A B C D
1 4 3 1 5.5
3 3.0
5 1 1 7.0
3 3.0
df2 = data[['A', 'C']].groupby(['A'])['C'].agg(['sum', 'count'])
print (df2)
sum count
A
1 12 6
print (pd.merge(df1.reset_index(['B','C','D']), df2, left_index=True, right_index=True)
.set_index(['B','C','D'], append=True))
E sum count
A B C D
1 4 3 1 5.5 12 6
3 3.0 12 6
5 1 1 7.0 12 6
3 3.0 12 6
I try another solution to get output from df1, but this is aggregated so it is impossible get right data. If sum level C, you get 8 instead 12.

Categories