Pandas Series with different lengths

Pandas Series with different lengths - python

Using pandas concat function it is possible to create a series like this:
In[230]pd.concat({'One':pd.Series(range(3)), 'Two':pd.Series(range(4))})
Out[230]:
One 0 0
1 1
2 2
Two 0 0
1 1
2 2
3 3
dtype: int64
Is it possible to do the same without using concat method?
My best approach was:
a = pd.Series(range(3),range(3))
b = pd.Series(range(4),range(4))
pd.Series([a,b],index=['One','Two'])
But it is not the same, it outputs:
One 0 0
1 1
2 2
dtype: int64
Two 0 0
1 1
2 2
3 3
dtype: int64
dtype: object

This should give you an idea of just how useful concat is.
a.index = pd.MultiIndex.from_tuples([('One', v) for v in a.index])
b.index = pd.MultiIndex.from_tuples([('Two', v) for v in b.index])
a.append(b)
One 0 0
1 1
2 2
Two 0 0
1 1
2 2
3 3
dtype: int64
The same thing is achieved by pd.concat([a, b]).

This is the work for the argument keys in case you want to get the same output using concat i.e :
pd.concat([a,b],keys=['One','Two'])
One 0 0
1 1
2 2
Two 0 0
1 1
2 2
3 3
dtype: int64

This works fine:
data = list(range(3)) + list(range(4))
index = MultiIndex(levels=[['One', 'Two'], [0, 1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 3]])
pd.Series(data,index=index)

Related

how nunique works with given table values?

yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
output:
A B
0 1 1
1 2 1
2 3 1
yf.nunique(axis=0)
output:
A 3
B 1
yf.nunique(axis=1)
output:
0 1
1 2
2 2
could you please how axis=0 and axis=1 works? In axis=0, why A=2, B=1 are ignored? Wonder if nunique gets in index as well?

You can test number of unique values per columns or per index by DataFrame.nunique.
yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
print (yf)
A B
0 1 1
1 2 1
2 3 1
print (yf.nunique(axis=0))
A 3
B 1
dtype: int64
print (yf.nunique(axis=1))
0 1
1 2
2 2
dtype: int64
It means:
A is 3, because 3 unique values in column A
0 is 1, because 1 unique values in row 0

Count number of occurrences in Pandas with a specified list

I have a list of possible integer numbers:
item_list = [0,1,2,3]
and some of the numbers do not necessarily will appear in my dataframe. For example with:
df = pd.DataFrame({'a': [0, 2, 0, 1, 0, 1, 0]})
executing
df['a'].value_counts()
will yield
0 5
1 2
2 1
Name: a, dtype: int64
but I am interested in all occurrences of all my 'item_list = [0,1,2,3]', so basically, I would like to see something like:
0 5
1 2
2 1
3 0
Name: a, dtype: int64
where the first column is 'item_list'
How to get this result?

You can also use reindex:
df['a'].value_counts().reindex(item_list).fillna(0)

You can convert values to Categorical:
item_list = [0,1,2,3]
df.a = df.a.astype('category', categories=item_list)
print (df['a'].value_counts())
0 5
1 2
2 1
3 0
Name: a, dtype: int64
With reindex and parameter fill_value:
print (df['a'].value_counts().reindex(item_list, fill_value=0))
0 5
1 2
2 1
3 0
Name: a, dtype: int64

Counting the number of missing/NaN in each row

I've got a dataset with a big number of rows. Some of the values are NaN, like this:
In [91]: df
Out[91]:
1 3 1 1 1
1 3 1 1 1
2 3 1 1 1
1 1 NaN NaN NaN
1 3 1 1 1
1 1 1 1 1
And I want to count the number of NaN values in each row, it would be like this:
In [91]: list = <somecode with df>
In [92]: list
Out[91]:
[0,
0,
0,
3,
0,
0]
What is the best and fastest way to do it?

You could first find if element is NaN or not by isnull() and then take row-wise sum(axis=1)
In [195]: df.isnull().sum(axis=1)
Out[195]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64
And, if you want the output as list, you can
In [196]: df.isnull().sum(axis=1).tolist()
Out[196]: [0, 0, 0, 3, 0, 0]
Or use count like
In [130]: df.shape[1] - df.count(axis=1)
Out[130]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64

To count NaNs in specific rows, use
cols = ['col1', 'col2']
df['number_of_NaNs'] = df[cols].isna().sum(1)
or index the columns by position, e.g. count NaNs in the first 4 columns:
df['number_of_NaNs'] = df.iloc[:, :4].isna().sum(1)

Pandas indexing by both boolean `loc` and subsequent `iloc`

I want to index a Pandas dataframe using a boolean mask, then set a value in a subset of the filtered dataframe based on an integer index, and have this value reflected in the dataframe. That is, I would be happy if this worked on a view of the dataframe.
Example:
In [293]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7],
'b': [5, 5, 2, 2, 5, 5, 2, 2],
'c': [0, 0, 0, 0, 0, 0, 0, 0]})
mask = (df['a'] < 7) & (df['b'] == 2)
df.loc[mask, 'c']
Out[293]:
2 0
3 0
6 0
Name: c, dtype: int64
Now I would like to set the values of the first two elements returned in the filtered dataframe. Chaining an iloc onto the loc call above works to index:
In [294]:
df.loc[mask, 'c'].iloc[0: 2]
Out[294]:
2 0
3 0
Name: c, dtype: int64
But not to assign:
In [295]:
df.loc[mask, 'c'].iloc[0: 2] = 1
print(df)
a b c
0 0 5 0
1 1 5 0
2 2 2 0
3 3 2 0
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
Making the assign value the same length as the slice (i.e. = [1, 1]) also doesn't work. Is there a way to assign these values?

This does work but is a little ugly, basically we use the index generated from the mask and make an additional call to loc:
In [57]:
df.loc[df.loc[mask,'c'].iloc[0:2].index, 'c'] = 1
df
Out[57]:
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
So breaking the above down:
In [60]:
# take the index from the mask and iloc
df.loc[mask, 'c'].iloc[0: 2]
Out[60]:
2 0
3 0
Name: c, dtype: int64
In [61]:
# call loc using this index, we can now use this to select column 'c' and set the value
df.loc[df.loc[mask,'c'].iloc[0:2].index]
Out[61]:
a b c
2 2 2 0
3 3 2 0

How about.
ix = df.index[mask][:2]
df.loc[ix, 'c'] = 1
Same idea as EdChum but more elegant as suggested in the comment.
EDIT: Have to be a little bit careful with this one as it may give unwanted results with a non-unique index, since there could be multiple rows indexed by either of the label in ix above. If the index is non-unique and you only want the first 2 (or n) rows that satisfy the boolean key, it would be safer to use .iloc with integer indexing with something like
ix = np.where(mask)[0][:2]
df.iloc[ix, 'c'] = 1

I don't know if this is any more elegant, but it's a little different:
mask = mask & (mask.cumsum() < 3)
df.loc[mask, 'c'] = 1
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0

How to repeat a Pandas DataFrame?

This is my DataFrame that should be repeated for 5 times:
>>> x = pd.DataFrame({'a':1,'b':2}, index = range(1))
>>> x
a b
0 1 2
I want to have the result like this:
>>> x.append(x).append(x).append(x)
a b
0 1 2
0 1 2
0 1 2
0 1 2
But there must be a smarter way than appending 4 times. Actually the DataFrame I’m working on should be repeated 50 times.
I haven't found anything practical, including those like np.repeat ---- it just doesn't work on a DataFrame.
Could anyone help?

You can use the concat function:
In [13]: pd.concat([x]*5)
Out[13]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
If you only want to repeat the values and not the index, you can do:
In [14]: pd.concat([x]*5, ignore_index=True)
Out[14]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2

I think it's cleaner/faster to use iloc nowadays:
In [11]: np.full(3, 0)
Out[11]: array([0, 0, 0])
In [12]: x.iloc[np.full(3, 0)]
Out[12]:
a b
0 1 2
0 1 2
0 1 2
More generally, you can use tile or repeat with arange:
In [21]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 4
In [23]: np.tile(np.arange(len(df)), 3)
Out[23]: array([0, 1, 0, 1, 0, 1])
In [24]: np.repeat(np.arange(len(df)), 3)
Out[24]: array([0, 0, 0, 1, 1, 1])
In [25]: df.iloc[np.tile(np.arange(len(df)), 3)]
Out[25]:
A B
0 1 2
1 3 4
0 1 2
1 3 4
0 1 2
1 3 4
In [26]: df.iloc[np.repeat(np.arange(len(df)), 3)]
Out[26]:
A B
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
Note: This will work with non-integer indexed DataFrames (and Series).

Try using numpy.repeat:
>>> import numpy as np
>>> df = pd.DataFrame(np.repeat(x.to_numpy(), 5, axis=0), columns=x.columns)
>>> df
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2

I would generally not repeat and/or append, unless your problem really makes it necessary - it is highly inefficiently and typically comes from not understanding the proper way to attack a problem.
I don't know your exact use case, but if you have your values stored as
values = array(1, 2)
df2 = pd.DataFrame(index=arange(0,50), columns=['a', 'b'])
df2[['a', 'b']] = values
will do the job. Perhaps you want to better explain what you're trying to achieve?

Append should work too:
In [589]: x = pd.DataFrame({'a':1,'b':2},index = range(1))
In [590]: x
Out[590]:
a b
0 1 2
In [591]: x.append([x]*5, ignore_index=True) #Ignores the index as per your need
Out[591]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
5 1 2
In [592]: x.append([x]*5)
Out[592]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2

Without numpy, we could also use Index.repeat and loc (or reindex):
x.loc[x.index.repeat(5)].reset_index(drop=True)
or
x.reindex(x.index.repeat(5)).reset_index(drop=True)
Output:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2

Apply by row-lambda is a universal approach in my opinion:
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
df.apply(lambda row: row.repeat(2), axis=0) #.reset_index()
Out[1]:
A B
0 1 2
0 1 2
1 3 4
1 3 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Series with different lengths - python

This is the work for the argument keys in case you want to get the same output using concat i.e : pd.concat([a,b],keys=['One','Two']) One 0 0 1 1 2 2 Two 0 0 1 1 2 2 3 3 dtype: int64

This works fine: data = list(range(3)) + list(range(4)) index = MultiIndex(levels=[['One', 'Two'], [0, 1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 3]]) pd.Series(data,index=index)

Related

how nunique works with given table values?

Count number of occurrences in Pandas with a specified list

Counting the number of missing/NaN in each row

Pandas indexing by both boolean `loc` and subsequent `iloc`

How to repeat a Pandas DataFrame?

Categories

Resources