merge items into a list based on index - python

Suppose I have a dataframe as follows:
df = pd.DataFrame(range(4), index=range(4))
df = df.append(df)
the resultant df is:
0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3
I want to combine the values of the same index into a list. The desired result is:
0 [0,0]
1 [1,1]
2 [2,2]
3 [3,3]
For a more realistic scenario, my index will be dates, and I want to aggregate multiple obs into a list based on the date. In this way, I can perform some functions on the obs for each date.

For a more realistic scenario, my index will be dates, and I want to
aggregate multiple obs into a list based on the date. In this way, I
can perform some functions on the obs for each date.
If that's your goal, then I don't think you want to actually materialize a list. What you want to do is use groupby and then act on the groups. For example:
>>> df.groupby(level=0)
<pandas.core.groupby.DataFrameGroupBy object at 0xa861f6c>
>>> df.groupby(level=0)[0]
<pandas.core.groupby.SeriesGroupBy object at 0xa86630c>
>>> df.groupby(level=0)[0].sum()
0 0
1 2
2 4
3 6
Name: 0, dtype: int64
You could extract a list too:
>>> df.groupby(level=0)[0].apply(list)
0 [0, 0]
1 [1, 1]
2 [2, 2]
3 [3, 3]
Name: 0, dtype: object
but it's usually better to act on the groups themselves. Series and DataFrames aren't really meant for storing lists of objects.

In [374]:
import pandas as pd
df = pd.DataFrame({'a':range(4)})
df = df.append(df)
df
Out[374]:
a
0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3
[8 rows x 1 columns]
In [379]:
import numpy as np
# loop over the index values and flatten them using numpy.ravel and cast to a list
for index in df.index.values:
# use loc to select the values at that index
print(index, list((np.ravel(df.loc[index].values))))
# handle condition where we have reached the max value of the index, otherwise we output the values twice
if index == max(df.index.values):
break
0 [0, 0]
1 [1, 1]
2 [2, 2]
3 [3, 3]

Related

Efficient way in Pandas to count occurrences of Series of values by row

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row. For reproducibility let's assume the following simplified dataframe:
data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
A B C D E
0 3 4 1 1 4
1 2 3 2 1 3
2 1 2 3 2 2
3 0 1 4 2 4
How can I count the number of occurrences of specific values (given by a series with the same size) by row?
Again for simplicity, let's assume this value_series is given by the max of each row.
values_series = df.max(axis=1)
0 4
1 3
2 3
3 4
dtype: int64
The solution I got to seems not very pythonic (e.g. I'm using iterrows(), which is slow):
max_count = []
for index, row in df.iterrows():
max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)
Is there any more efficient way to do this?
We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:
(df.T == df.max(axis=1)).sum()
# result
0 2
1 1
2 1
3 2
dtype: int64
(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)
You can try
df.eq(df.max(1),axis=0).sum(1)
Out[361]:
0 2
1 1
2 1
3 2
dtype: int64
The perfect job for numpy broadcasting:
a = df.to_numpy()
b = values_series.to_numpy()[:, None]
(a == b).sum(axis=1)

How to obtain the number (count) of new indexes affected when a new column is considered

i have the next problem to solve.
i hava a huge dataframe (14 k rows x 1600 columns) consisting of 1 and 0s. I need to obtain the unique new different values when i considered a new column. Meaning, i have the index column and the first column, then if i consider the second column, i need to be able to obtain the 'count' of how many of the rows are different to the first column. Then, consider the third column and obtain the count of the different values to those from 1st and 2nd column and so on. for example the following dataset:
import pandas as pd
data = [[1, 1, 0], [1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 0, 1]]
df = pd.DataFrame(data, columns=["S1", "S2", "S3"])
df
(1 is present, 0 is absence, this means, in column1, the index(0) was 'observed', in colum2 is '0', meaning it was not observed, and so on).
because i'm not sure how to write the code, i don't know if is easier to get a new row at the end with the count of the new values or transpose the df and obtain a new column with those values. in any case the output i expect should be something like this:
import pandas as pd
data = [[1, 1, 0], [1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 0, 1], [3, 1, 1]]
df_out = pd.DataFrame(data, columns=["S1", "S2", "S3"])
df_out
Here you can see that with just column 1 there are 3 unique index-value pairs, when we considered columns 1 and 2, we have 2 repeated values but 1 new, and when we add the third column we have just 1 new value when compared with 1 and 2...
So, to clarify myself, look the image below. enter image description here
for this example, i need to count the total of '1' present in column1, then, when i considered column2 i need the count of cases [0,1], when i considered a third column i need the count of the cases [0,0,1], for a fourth column the cases [0,0,0,1] and so on.
In this link you can download a small section of the original DF with the total unique '1' at the end (obtained manually)
I need to obtain that kind of output for the entire dataframe.
Hope someone can help.
Thanks!!!
You can either use #Corralien's solution with a bit of pre-processing:
df[~df.sum(axis=1).eq(0)].idxmax(axis=1).value_counts()
or, alternatively
df.cumsum(axis=1).cumsum(axis=1).eq(1).sum()
old answer
You can compute the difference with the shifted dataframe and sum:
df2 = ((df-df.shift(axis=1)).eq(1))
df2['S1'] = df['S1']
df.append(df2.sum(), ignore_index=True)
output:
S1 S2 S3
0 1 1 0
1 1 0 0
2 0 1 1
3 1 1 1
4 0 0 1
5 3 1 1
How it works:
>>> (df-df.shift(axis=1))
S1 S2 S3
0 NaN 0.0 -1.0
1 NaN -1.0 0.0
2 NaN 1.0 0.0
3 NaN 0.0 0.0
4 NaN 0.0 1.0
>>> (df-df.shift(axis=1)).eq(1)
S1 S2 S3
0 False False False
1 False False False
2 False True False
3 False False False
4 False False True
>>> df2['S1'] = df['S1']
S1 S2 S3
0 1 False False
1 1 False False
2 0 True False
3 1 False False
4 0 False True
>>> df2.sum()
S1 3
S2 1
S3 1
dtype: int64
for this example, i need to count the total of '1' present in column1, then, when i considered column2 i need the count of cases [0,1], when i considered a third column i need the count of the cases [0,0,1], for a fourth column the cases [0,0,0,1] and so on.
In fact, you want to count where '1' appears for the first time:
>>> df[~df.eq(0).all(axis=1)].idxmax(axis=1).value_counts()
S1 151
S2 148
S3 113
dtype: int64
>>> df.append(df[~df.eq(0).all(axis=1)].idxmax(axis=1).value_counts(), ignore_index=True)
S1 S2 S3
0 1 1 1
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
... ... ... ...
14338 0 0 0
14339 0 0 0
14340 0 0 0
14341 0 0 0
14342 151 148 113
[14343 rows x 3 columns]

How to find all the zero cells in a python panda dataframe and replace them?

My data is like this:
df = pd.DataFrame({'a': [5,0,0, 6, 0, 0, 0 , 12]})
I want to count the zeros above the 6 and replace them with (6/count+1)=(6/3)=2 (I will also replace the original 6)
I also want to do a similar thing with the zeros above the 12.
So, (12/count)=(12/3)=4
So the final result will be:
[5,2,2, 2, 3, 3, 3 , 3]
I am not sure how to start. Are there any functions that do this?
Thanks.
Use GroupBy.transform with mean and custom groups created with test not equal 0, swap order, cumulative sum and swap order to original:
g = df['a'].ne(0).iloc[::-1].cumsum().iloc[::-1]
df['b'] = df.groupby(g)['a'].transform('mean')
print (df)
a b
0 5 5
1 0 2
2 0 2
3 6 2
4 0 3
5 0 3
6 0 3
7 12 3

Pandas: assign a list to all rows of a multi-index dataframe

I have a multi index dataframe, let's say
index = [['a', 'a', 'b', 'b'],[1, 2, 1, 2]]
df = pd.DataFrame([1,2,3,4], index=index)
0
a 1 1
2 2
b 1 3
2 4
If I want to add a new column with a constant value, I can just do
df['new_col'] = 'IamNew'
0 new_col
a 1 1 IamNew
2 2 IamNew
b 1 3 IamNew
2 4 IamNew
Perfect.
However, what if I want to add a new column with a list? This doesn't work
df['new_col']=[1,2]
ValueError: Length of values does not match length of index
I have tried many options and spent quite some time trying to figure this out.
Any idea?
First I think working with lists in pandas is not good idea, but possible:
df['new_col']=pd.Series([[1,2]] * len(df), index=df.index)
print (df)
0 new_col
a 1 1 [1, 2]
2 2 [1, 2]
b 1 3 [1, 2]
2 4 [1, 2]
Another solution:
df['new_col']= [[1,2]] * len(df)

rowbind elements of list into pandas data frame by grouping

I'm wondering what is the pythonic way of achieving the following:
Given a list of list:
l = [[1, 2],[3, 4],[5, 6],[7, 8]]
I would like to create a list of pandas data frames where the first pandas data frame is a row bind of the first two elements in l and the second a row bind of the last two elements:
>>> df1 = pd.DataFrame(np.asarray(l[:2]))
>>> df1
0 1
0 1 2
1 3 4
and
>>> df2 = pd.DataFrame(np.asarray(l[2:]))
>>> df2
0 1
0 5 6
1 7 8
In my problem I have a very long list and I know the grouping, i.e. the first k elements of the list l should be rowbinded to form the first df. How can this be achieved in a python friendly way?
You could store them in dict like
In [586]: s = pd.Series(l)
In [587]: k = 2
In [588]: df = {k:pd.DataFrame(g.values.tolist()) for k, g in s.groupby(s.index//k)}
In [589]: df[0]
Out[589]:
0 1
0 1 2
1 3 4
In [590]: df[1]
Out[590]:
0 1
0 5 6
1 7 8

Categories