Obtaining the first few rows of a dataframe - python

Is there a way to get the first n rows of a dataframe without using the indices. For example, I know if I have a dataframe called df I could get the first 5 rows via df.ix[5:]. But, what if my indices are not ordered and I dont want to order them? This does not seem to work. Hence, I was wondering if there is another way to select the first couple of rows. I apologize if there is already an answer to this. I wasnt able to find one.

Use head(5) or iloc[:5]
In [7]:
df = pd.DataFrame(np.random.randn(10,3))
df
Out[7]:
0 1 2
0 -1.230919 1.482451 0.221723
1 -0.302693 -1.650244 0.957594
2 -0.656565 0.548343 1.383227
3 0.348090 -0.721904 -1.396192
4 0.849480 -0.431355 0.501644
5 0.030110 0.951908 -0.788161
6 2.104805 -0.302218 -0.660225
7 -0.657953 0.423303 1.408165
8 -1.940009 0.476254 -0.014590
9 -0.753064 -1.083119 -0.901708
In [8]:
df.head(5)
Out[8]:
0 1 2
0 -1.230919 1.482451 0.221723
1 -0.302693 -1.650244 0.957594
2 -0.656565 0.548343 1.383227
3 0.348090 -0.721904 -1.396192
4 0.849480 -0.431355 0.501644
In [11]:
df.iloc[:5]
Out[11]:
0 1 2
0 -1.230919 1.482451 0.221723
1 -0.302693 -1.650244 0.957594
2 -0.656565 0.548343 1.383227
3 0.348090 -0.721904 -1.396192
4 0.849480 -0.431355 0.501644

Related

Pandas sampling a dataframe but treating multiple rows as a single row based on column

Consider the following toy code that performs a simplified version of my actual question:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,2,3,4,5],
'some column': [0,1,2,3,4],
}
)
df = df.set_index(['n_event'])
print(df)
resampled_df = df.sample(frac=1, replace=True)
print(resampled_df)
The resampled_df is, as it name suggests, a resampled version of the original one (with replacement). This is exactly what I want. An example output of the previous code is
some column
n_event
1 0
2 1
3 2
4 3
5 4
some column
n_event
4 3
1 0
4 3
4 3
2 1
Now for my actual question I have the following dataframe:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
}
)
df = df.set_index(['n_event','n_channel'])
print(df)
which looks like
some column
n_event n_channel
1 1 0
2 1
2 1 2
2 3
3 1 4
2 5
4 1 6
2 7
5 1 8
2 9
I want to do exactly the same as before, resample with replacements, but treating each group of rows with the same n_event as a single entity. A hand-built example of what I want to do can look like this:
some column
n_event n_channel
2 1 2
2 3
2 1 2
2 3
3 1 4
2 5
1 1 0
2 1
5 1 8
2 9
As seen, each n_event was treated as a whole and things within each event were no mixed up.
How can I do this without proceeding by brute force (i.e. without for loops, etc)?
I have tried with df.sample(frac=1, replace=True, ignore_index=False) and a few things using group_by without success.
Would a pivot()/melt() sequence work for you?
Use pivot() to from long to wide (make each group a single row).
Do the sampling.
Then back from wide to long using melt().
Don't have time to work out a full answer but thought I would get this idea to you in case it might help you.
Following the suggestion of jch I was able to find a solution by combining pivot and stack:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
'other col': [5,6,4,3,2,5,2,6,8,7],
}
)
resampled_df = df.pivot(
index = 'n_event',
columns = 'n_channel',
values = set(df.columns) - {'n_event','n_channel'},
)
resampled_df = resampled_df.sample(frac=1, replace=True)
resampled_df = resampled_df.stack()
print(resampled_df)

How to perform groupby and remove duplicate based on first occurrence of a column condition?

This problem is a bit hard for me to wrap my head around so I hope I can explain it properly below.
I have a data frame with a lot of rows but only 3 columns like below:
data = {'line_group': [1,1,8,8,4,4,5,5],
'route_order': [1,2,1,2,1,2,1,2],
'StartEnd':['20888->20850','20888->20850','20888->20850','20888->20850',
'20961->20960','20961->20960','20961->20960','20961->20960']}
df = pd.DataFrame(data)
In the end, I want to use this data to plot routes between points for instance 20888 to 20850. But the problem is that there are a lot of trips/line_group that also goes through these two points so when I do plot things, it will be overlapping and very slow which is not what I want.
So I only want the first line_group which has the unique StartEnd like in the data frame below:
I believe it could have something to do with groupby like in the following code below that I have tried but it doesn't produce the results I want. And in the full dataset route orders aren't usually just from 1 point to another and can be up to much more (E.g 1,2,3,4,...).
drop_duplicates(subset='StartEnd', keep="first")
Group by StartEnd and keep only the first line_group value
Then filter to rows which contain the unique line groups
unique_groups = df.groupby('StartEnd')['line_group'].agg(lambda x: list(x)[0]).reset_index()
StartEnd line_group
20888->20850 1
20961->20960 4
unique_line_groups = unique_groups['line_group']
filtered_df = df[df['line_group'].isin(unique_line_groups)]
Final Output
line_group route_order StartEnd
1 1 20888->20850
1 2 20888->20850
4 1 20961->20960
4 2 20961->20960
You can add in route_order to the argument subset to get output you want.
In [8]: df.drop_duplicates(subset=['StartEnd', 'route_order'], keep='first')
Out[8]:
line_group route_order StartEnd
0 1 1 20888->20850
1 1 2 20888->20850
4 4 1 20961->20960
5 4 2 20961->20960
You can use groupby.first():
df.groupby(["route_order", "StartEnd"], as_index=False).first()
output:
route_order StartEnd line_group
0 1 20888->20850 1
1 1 20961->20960 4
2 2 20888->20850 1
3 2 20961->20960 4

Dropping rows in pandas with .index

I came across the below line of code, which gives an error when '.index' is not present in it.
print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))
What is the purpose of '.index' while using drop in pandas?
As explained in the documentation, you can use drop with index:
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df.drop([0, 1]) # Here 0 and 1 are the index of the rows
Output:
A B C D
2 8 9 10 11
In this case it will drop the first 2 rows.
With .index in your example, you find the rows where Quantity=0and retrieve their index(and then use like in the documentation)
this is the detail about .drop() method:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
.drop() method need a parameter 'label' which is a list of index labels(when axis=0, which is the default case) or columns labels (when axis=1).
df[df['Quantity'] == 0] returns a DataFrame where Quantity=0, but what we need is the index label where Quantity=0, so .index is needed.

extracting numerical information from strings in a dataframe column

I've seen this done in excel but I'd like to split the SOP and number into different columns. It gets a little tricky since the formatting is different at times.
0 SOP-015641
1 SOP-007809
2 SOP018262
3 SOP-007802
4 SOP-007804
5 SOP-007807
use .str.extract() method:
In [8]: df[['a','b']] = df.pop('col').str.extract('(\D+)(\d+)', expand=True)
In [9]: df
Out[9]:
a b
0 SOP- 015641
1 SOP- 007809
2 SOP 018262
3 SOP- 007802
4 SOP- 007804
5 SOP- 007807
RegEx explained

Pandas Multi-Colum Boolean Indexing/Selection with Dict Generator

Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.

Categories