Equivalent of LIMIT and OFFSET of SQL in pandas? - python

I have a dataframe like this:
id type city
0 2 d H
1 7 c J
2 7 x Y
3 2 o G
4 6 i F
5 5 b E
6 6 v G
7 8 u L
8 1 g L
9 8 k U
I would like to get the similar output using pandas as in SQL command:
select id,type
from df
order by type desc
limit 4
offset 2
The required result is:
id type
0 8 u
1 2 o
2 8 k
3 6 i
I tried to follow the official tutorial https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#top-n-rows-with-offset
df.nlargest(4+2, columns='type').tail(4)
But, this fails.
How to solve the problem?
UPDATE
import numpy as np
import pandas as pd
import pandasql as pdsql
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
df = pd.read_csv('http://ourairports.com/data/airports.csv')
q = '''
select id,type
from df
order by type desc
limit 4
offset 2
'''
print(pysqldf(q))
```
id type
0 6525 small_airport
1 322127 small_airport
2 6527 small_airport
3 6528 small_airport
```
Using pandas:
print(df.sort_values('type', ascending=False).iloc[2:2+4][['id','type']])
id type
43740 37023 small_airport
43739 37022 small_airport
24046 308281 small_airport
24047 309587 small_airport

Yes, integer location, where iloc starting index is the 'offset' and ending index is incremented by 'limit':
df.sort_values('type', ascending=False).iloc[2:6]
Output:
id type city
7 8 u L
3 2 o G
9 8 k U
4 6 i F
And you can add reset_index to clean up indexing.
print(df.sort_values('type', ascending=False).iloc[2:6].reset_index(drop=True))
Output:
id type city
0 8 u L
1 2 o G
2 8 k U
3 6 i F
Update let's sort by type and index:
df.index.name = 'index'
df[['id','type']].sort_values(['type','index'], ascending=[False,True]).iloc[2:6]
Output:
index id type
0 3 6525 small_airport
1 5 322127 small_airport
2 6 6527 small_airport
3 7 6528 small_airport

You could use sort_values with ascending=False, and use .loc() to slice the result (having reset the index) with the rows and columns of interest:
offset = 2
limit = 4
(df.sort_values(by='type', ascending=False).reset_index(drop=True)
.loc[offset : offset+limit-1, ['id','type']])
id type
2 8 u
3 2 o
4 8 k
5 6 i

Related

How to sort column by value counts in pandas

I want to sort a dataframe in pandas. I wanna do it by sorting 2 columns by value counts. One depends on the other. As seen in the image, I have achieved categorical sorting. However, I want the column 'category' to be sorted by value counts. And then the dataframe is to be sorted again based on 'beneficiary_name' under the same category.
This is the code I have written to achieve this till now.
data_category = data_category.sort_values(by=['category','beneficiary_name'], ascending=False)
Please help me figure this out. Thanks.
Inspired by this related question:
Create column of value_counts in Pandas dataframe
import pandas as pd
df = pd.DataFrame({'id': range(9), 'cat': list('ababaacdc'), 'benef': list('uuuuiiiii')})
print(df)
# id cat benef
# 0 0 a u
# 1 1 b u
# 2 2 a u
# 3 3 b u
# 4 4 a i
# 5 5 a i
# 6 6 c i
# 7 7 d i
# 8 8 c i
df['cat_count'] = df.groupby(['cat'])['id'].transform('count')
print(df)
# id cat benef cat_count
# 0 0 a u 4
# 1 1 b u 2
# 2 2 a u 4
# 3 3 b u 2
# 4 4 a i 4
# 5 5 a i 4
# 6 6 c i 2
# 7 7 d i 1
# 8 8 c i 2
df = df.sort_values(by=['cat_count', 'cat', 'benef'], ascending=False)
print(df)
# id cat benef cat_count
# 0 0 a u 4
# 2 2 a u 4
# 4 4 a i 4
# 5 5 a i 4
# 6 6 c i 2
# 8 8 c i 2
# 1 1 b u 2
# 3 3 b u 2
# 7 7 d i 1

How to rest a row value to the nths rows values of another dataframe

I have this two df's
df1:
lon lat
0 -60.7 -2.8333333333333335
1 -55.983333333333334 -2.4833333333333334
2 -51.06666666666667 -0.05
3 -66.96666666666667 -0.11666666666666667
4 -48.483333333333334 -1.3833333333333333
5 -54.71666666666667 -2.4333333333333336
6 -44.233333333333334 -2.6
7 -59.983333333333334 -3.15
df2:
lon lat
0 -24.109 -2.0035
1 -17.891 -1.70911
2 -14.5822 -1.7470700000000001
3 -12.8138 -1.72322
4 -14.0688 -1.5028700000000002
5 -13.8406 -1.44416
6 -12.1292 -0.671266
7 -13.8406 -0.8824270000000001
8 -15.12 -18.223
I want to rest each value of df1['lat'] with all values of df2
Something like this :
results0=df1.loc[0,'lat']-df2.loc[:,'lat']
results1=df1.loc[1,'lat']-df2.loc[:,'lat']
#etc etc....
So i tried this:
for i,j in zip(range(len(df1)), range(len(df2))):
exec(f"result{i}=df1.loc[{i},'lat']-df2.loc[{j},'lat']")
But it only gave me one result value for each result, instead of 8 values for each result.
I will appreciate any possible solution. Thanks!
You can create list of Series:
L = [df1.loc[i,'lat']-df2['lat'] for i in df1.index]
Or you can use numpy for new DataFrame:
arr = df1['lat'].to_numpy() - df2['lat'].to_numpy()[:, None]
df3 = pd.DataFrame(arr, index=df2.index, columns=df1.index)
print (df3)
0 1 2 3 4 5 \
0 -0.829833 -0.479833 1.953500 1.886833 0.620167 -0.429833
1 -1.124223 -0.774223 1.659110 1.592443 0.325777 -0.724223
2 -1.086263 -0.736263 1.697070 1.630403 0.363737 -0.686263
3 -1.110113 -0.760113 1.673220 1.606553 0.339887 -0.710113
4 -1.330463 -0.980463 1.452870 1.386203 0.119537 -0.930463
5 -1.389173 -1.039173 1.394160 1.327493 0.060827 -0.989173
6 -2.162067 -1.812067 0.621266 0.554599 -0.712067 -1.762067
7 -1.950906 -1.600906 0.832427 0.765760 -0.500906 -1.550906
8 15.389667 15.739667 18.173000 18.106333 16.839667 15.789667
6 7
0 -0.596500 -1.146500
1 -0.890890 -1.440890
2 -0.852930 -1.402930
3 -0.876780 -1.426780
4 -1.097130 -1.647130
5 -1.155840 -1.705840
6 -1.928734 -2.478734
7 -1.717573 -2.267573
8 15.623000 15.073000
Since df1 has one less row than df2
df1['lat'] = df1['lat'] - df2.loc[:df1.shape[0]-1, 'lat']
output:
0 -0.829833
1 -0.774223
2 1.697070
3 1.606553
4 0.119537
5 -0.989173
6 -1.928734
7 -2.267573
Name: lat, dtype: float64

Pandas Dataframe: Select row where column contains X, in multiindex

So I am struggling with this data processing, I have a data file with
~100 rows * 24 columns.
I load the file with
data = pd.read_csv(fileName, header=[0,1])
df = pd.DataFrame(data=data)
Then I only want to work on a part of it so I select only those columns:
postSurvey = questionaireData[['QUESTNNR','PS01_01', 'PS02_01', 'PS02_02', 'PS03_01', 'PS03_02', 'PS03_03', 'PS04', 'PS05', 'PS06', 'PS07_01']]
Problem: Now I want to select the rows which contain 'PS' in 'QUESTNNR'
I can create this "list" of True/false, but when I try to use it i get:
onlyPS = postSurvey['QUESTNNR'] == 'PS'
postSurvey = postSurvey[onlyPS]
ValueError: cannot join with no level specified and no overlapping names
With this I get:
postSurvey.xs('PS', level='QUESTNNR')
AttributeError: 'RangeIndex' object has no attribute 'get_loc_level'
I have tried all sorts of solutions from stackoverflow and other sources, but need help.
dataframe:
A B C D E F G H Q W E R T Y U I O P S J K L Z X C V N M A1 A2 A3 S4 F4 G5
ASDF1 ASDF2 ASDF3 ASDF4 ASDF5 ASDF6 ASDF7 ASDF8 ASDF9 ASDF10 ASDF11 ASDF12 ASDF13 ASDF14 ASDF15 ASDF16 ASDF17 ASDF18 ASDF19 ASDF20 ASDF21 ASDF22 ASDF23 ASDF24 ASDF25 ASDF26 ASDF27 ASDF28 ASDF29 ASDF30 ASDF31 ASDF32 ASDF33 ASDF34
138 PS interview eng date 10 2 5 7 2012 10 1 13 1 26 1 0 1 1
129 QB2 interview eng date 4 6 5 56,10,34,7,20 1 0 2 2
130 QC1 interview eng date 6 2 6 7 1 0 2 2
131 QD2 interview eng date 3 8 6 5,8,15 1 0 2 2

Can pandas dataframe have dtype of list?

I'm new to Pandas, I process a dataset, where one of the columns is string with pipe (|) separated values. Now I have a task to remove any text in this |-separated field that's not fulfilling certain criteria.
My naive approach is to iterate the dataframe row by row and explode the field into list and validate this way. Then write the modified row back to the original dataframe. See this metasample:
for index, row in dataframe.iterrows():
fixed = [x[:29] for x in row['field'].split('|')]
dataframe.loc[index, 'field'] = "|".join(fixed)
Is there a better, and more importantly faster way to do this?
IIUC you can use:
dataframe = pd.DataFrame({'field':['aasd|bbuu|cccc|ddde|e','ffff|gggg|hhhh|i|j','cccc|u|k'],
'G':[4,5,6]})
print (dataframe)
G field
0 4 aasd|bbuu|cccc|ddde|e
1 5 ffff|gggg|hhhh|i|j
2 6 cccc|u|k
print (dataframe.field.str.split('|', expand=True)
.stack()
.str[:2] #change to 29
.groupby(level=0)
.apply('|'.join))
0 aa|bb|cc|dd|e
1 ff|gg|hh|i|j
2 cc|u|k
dtype: object
Another solution via list comprehension:
dataframe['new'] = pd.Series([[x[:2] for x in y] for y in dataframe.field.str.split('|')],
index=dataframe.index)
.apply('|'.join)
print (dataframe)
G field new
0 4 aasd|bbuu|cccc|ddde|e aa|bb|cc|dd|e
1 5 ffff|gggg|hhhh|i|j ff|gg|hh|i|j
2 6 cccc|u|k cc|u|k
dataframe = pd.DataFrame({'field':['aasd|bbuu|cc|ddde|e','ffff|gggg|hhhh|i|j','cccc|u|k'],
'G':[4,5,6]})
print (dataframe)
G field
0 4 aasd|bbuu|cc|ddde|e
1 5 ffff|gggg|hhhh|i|j
2 6 cccc|u|k
If need filter all values with values longer as 2:
s = dataframe.field.str.split('|', expand=True).stack()
print (s)
0 0 aasd
1 bbuu
2 cc
3 ddde
4 e
1 0 ffff
1 gggg
2 hhhh
3 i
4 j
2 0 cccc
1 u
2 k
dtype: object
dataframe['new'] = s[s.str.len() < 3].groupby(level=0).apply('|'.join)
print (dataframe)
G field new
0 4 aasd|bbuu|cc|ddde|e cc|e
1 5 ffff|gggg|hhhh|i|j i|j
2 6 cccc|u|k u|k
Another solution:
dataframe['new'] = pd.Series([[x for x in y if len(x) < 3] for y in dataframe.field.str.split('|')],
index=dataframe.index)
.apply('|'.join)
print (dataframe)
G field new
0 4 aasd|bbuu|cc|ddde|e cc|e
1 5 ffff|gggg|hhhh|i|j i|j
2 6 cccc|u|k u|k

Python pandas, multindex, slicing

I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df

Categories