How to slice rows between first and last row in pandas dataframe? - python

How can I use loc to slice everything between first and last row of following pandas DataFrame?
Input:
id text
0 A
1 B
2 C
3 D
Output:
| id | text |
|----|------|
| 1 | B |
| 2 | C |

Selecting rows
Use
df.iloc[1:-1] # similar to df.iloc[1:3]
id text
1 1 B
2 2 C
To slice all rows by position between 0 and -1 (exclusive).
Assigning to an existing column
Since iloc expects positional values, if you need to assign back, pass the column position as the second argument in this manner:
df.iloc[1:-1, df.columns.get_loc('text')] = 'Test'
df
id text
0 0 A
1 1 Test
2 2 Test
3 3 D
Assigning to a new column
Since your index values are numeric, you can simplify the options above (as a special case):
df.loc[1:2, 'text'] = 'Test' # note the slice is `1:2`, not `1:3` or `1:-1`
In the case of loc, the ending slice is inclusive. If you don't know at runtime what your index labels will be, you can generalise using:
# Generalized solution
df.loc[df.index[1]:df.index[-2], 'text'] = 'Test'
The solution above leaves NaNs to unassigned rows. To assign values to all rows, one option would be to use fillna from the solution above. But we can optimize a bit using np.where or a more specific expression that depends on your problem.

Related

How can I subset dataframe and put them on a list?

I'm looking for a more automated approach to subset this dataframe by rank and put them in a list. Because if there happens to be 150 ranks I can't do individual subsets.
ID | GROUP | RANK
1 | A | 1
2 | B | 2
3 | C | 3
2 | A | 1
2 | E | 2
2 | G | 3
How can I subset the dataframe by Rank and then put every subset in a list? (Not using group by)
I know how to individually subset them but I'm not sure how I can do this if there's more ranks.
Output:
ranks = [df1,df2,df3....and so on]
Just use groupby directly in a list comprehension
>>> [df for rank, df in df.groupby('RANK')]
This will generate a list of dataframes, each a sub-dataframe related to the corresponding rank.
You can also do a dict comprehension:
>>> dic = {rank: df for rank, df in df.groupby('RANK')}
such that you can access your df via dic[1] for rank == 1.
In more detail, pd.DataFrame.groupby is a method that returns a DataFrameGroupBy object. A DataFrameGroupBy object is an iterable, which means you can iterate over it with a for loop. This iterable generates tuples with two vales, where the first is whatever you used to group (in this case, an integer rank), and the second, the sub dataframe.

How to change the index of a series? Series don't have the set_index method

Unlike dataframes, series do not have the set_index method - which begs the question: if we want to change the index, how do we do it?
Accessing series.index works, but I seemed to remember it wasn't recommended?
I could convert to a dataframe, use the set_index method, then convert back to series, but it seems convoluted.
EDIT: I do not want to reindex, but to set a new index. As explained in the docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
By default values in the new index that do not have corresponding
records in the dataframe are assigned NaN.
That is NOT what I need. What I need is:
+---------------+-----------+
| Current index | New Index |
+---------------+-----------+
| 1 | 4 |
| 2 | 5 |
| 3 | 6 |
+---------------+-----------+
A toy example:
import pandas as pd
s = pd.Series(data =['a','b','c'])
#this works
s.index = [4,5,6]
# this doesn't work
s = s.set_index([4,5,6])
If you want to reindex, I recommend you just do a s.index with the new index list.
import pandas as pd
s = pd.Series(data =['a','b','c'], index=range(1,4))
print (s)
s = s.append(pd.Series(list('defg')))
s.index= range(11,19)
print (s)
Here's the output:
Initial Series:
1 a
2 b
3 c
dtype: object
After the append:
1 a
2 b
3 c
0 d
1 e
2 f
3 g
4 h
dtype: object
As you know appending just starts index from 0 again if we don't give an index.
Now if I want to set the index to new values, I can just do s.index. The result of s.index for range(11,19) is:
11 a
12 b
13 c
14 d
15 e
16 f
17 g
18 h
dtype: object
Lets Try .set_axis as an alternative because .set_index is a dataframe method and not a series method
s.set_axis([4,5,6], axis=0)
Can also use np.arange if you don't want to hard code values
s.set_axis(np.arange(len(s))+4, axis=0)

Select either a column or another based on condition

I need to select rows from a column (A) if it's not nan else from a another column (B), how to do this in Python?
The use case is that I'm inserting a new column C that will contain the result of an operation (func) on values from column A. Sometimes the values in A are nan, and in these cases I want to calculate the value in C from B, an example of the result would be this:
| A | B | C |
| bla | bla2 | func(bla) | #read from A
| nan | bla3 | func(bla3) | #read from B
What you want is combine_first:
df['C'] = df['A'].combine_first(df['B']).apply(func)
No explicit iteration required here...
This code should solve your problem.
df['C'] = [x['B'] if x['A'] == 'nan' else x['A'] for x in df.iterrows()]
basically you create a new list by iterating all rows and selecting the correct one, then you add it to de df.
By applying a function to the dataframe that checks for nan before it returns, you have the ability to switch to another value in the same row. It's important to give axis=1 if you're using apply on a dataframe (all columns) rather than a series (one column)
import pandas as pd
df = pd.DataFrame({'A':[1,2,-2],'B':[1,2,2]})
def fn(x):
c = np.sqrt(x['A'])
if np.isnan(c):
c = np.sqrt(x['B'])
return c
df['C'] = df.apply(fn,axis=1)

Matching dictionaries with columns and indices in DataFrame | python

I have a DataFrame with column names as on example and the indices from 0 to 1000. The dataframe is filled with zeros.
House 1 | House 2 | House 5 | House 8 | ...
0
1
2
3
4...
Then, I have dictionary, e.g.:
dict_of_houses = {'House 1':[100,201,306,387,500,900],'House 2':[31,87,254,675,987],'House 5':[23,45,67,123,345,654,789,808,864,987,999],'House 8':[23,675,786,858,868,912,934]}
Dictionary name edited in order not to confuse anyone later.
My goal is to:
for every dict key match it with the column
for every number in the list as dictionary value to match with the index
if there is a match of index and column, then change the cell to 1
else: leave zero
How would you do that?
You can use a for loop:
for house, indices in dict_.items():
df.loc[indices, house] = 1

Why accessing values in dataframe and list are different?

Suppose I have a list a defined as: a =[[1,2,3,4],[5,6,7,8]];
then a[0] returns the first element in the list: [1,2,3,4].
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]])
df is represented as
0 | 1 2 3 4
1 | 5 6 7 8
If I use df[0] the following value is printed:
0 | 1
1 | 5
I would have expected the first row ie 0| 1 2 3 4 to be printed like in a list. Is it because dataframe is represented by dataframe[cols][rows] rather than dataframe[rows][cols]?
df[x] accesses column(s) named x.
df.loc[y] access row(s) with index y.
This is an issue with syntax, not how data is stored internally by pandas.
You should read Indexing and Selecting Data to understand the various ways you can extract data from a pd.DataFrame object.

Categories