Pandas Dataframe: Select row where column contains X, in multiindex - python

So I am struggling with this data processing, I have a data file with
~100 rows * 24 columns.
I load the file with
data = pd.read_csv(fileName, header=[0,1])
df = pd.DataFrame(data=data)
Then I only want to work on a part of it so I select only those columns:
postSurvey = questionaireData[['QUESTNNR','PS01_01', 'PS02_01', 'PS02_02', 'PS03_01', 'PS03_02', 'PS03_03', 'PS04', 'PS05', 'PS06', 'PS07_01']]
Problem: Now I want to select the rows which contain 'PS' in 'QUESTNNR'
I can create this "list" of True/false, but when I try to use it i get:
onlyPS = postSurvey['QUESTNNR'] == 'PS'
postSurvey = postSurvey[onlyPS]
ValueError: cannot join with no level specified and no overlapping names
With this I get:
postSurvey.xs('PS', level='QUESTNNR')
AttributeError: 'RangeIndex' object has no attribute 'get_loc_level'
I have tried all sorts of solutions from stackoverflow and other sources, but need help.
dataframe:
A B C D E F G H Q W E R T Y U I O P S J K L Z X C V N M A1 A2 A3 S4 F4 G5
ASDF1 ASDF2 ASDF3 ASDF4 ASDF5 ASDF6 ASDF7 ASDF8 ASDF9 ASDF10 ASDF11 ASDF12 ASDF13 ASDF14 ASDF15 ASDF16 ASDF17 ASDF18 ASDF19 ASDF20 ASDF21 ASDF22 ASDF23 ASDF24 ASDF25 ASDF26 ASDF27 ASDF28 ASDF29 ASDF30 ASDF31 ASDF32 ASDF33 ASDF34
138 PS interview eng date 10 2 5 7 2012 10 1 13 1 26 1 0 1 1
129 QB2 interview eng date 4 6 5 56,10,34,7,20 1 0 2 2
130 QC1 interview eng date 6 2 6 7 1 0 2 2
131 QD2 interview eng date 3 8 6 5,8,15 1 0 2 2

Related

How to filter rows based on comparing another column's value

I want to produce a new dataframe from the old dataframe.
>>> old_df
id name value
0 aa A 10
1 bb B 20
2 aa C 7
3 cc D 30
4 aa E 25
5 bb F 12
6 dd G 100
I only want those rows where value is lowest.
>>> new_df
id name value
0 aa C 7
1 bb F 12
2 cc D 30
3 dd G 100
Can anyone help me??
This will find the min of each id based on the value column
df.loc[df.groupby('id')['value'].idxmin()]
Here is working example:
idx = old_df.groupby(['id'])['value'].transform(min) == old_df['value']
new_df = old_df[idx].sort_values('id')

How to position elements in a table in Pandas

I'm using Pandas, and I would like to reposition elements with the columns, I currently have:
Type Label
Initial
2022
2023
Difference
APPS
A/B/C
500
469
31
BACS
B/C/D
5
3
2
CAPS
C/D/E
10
5
5
I would like to have the table to be displayed like this:
Type Label/Initial
2022
2023
Difference
APPS
500
469
31
A
B
C
BACS
5
3
2
B
C
D
CAPS
10
5
5
C
D
E
Join columns Type and Initial with DataFrame.pop for droping column, then split by / and use DataFrame.explode, rename columns names and set empty string for repeated values:
s = (df['Type Label'] + '/ ' + df.pop('Initial')).str.split('/')
df = df.assign(**{'Type Label':s} ).explode('Type Label').rename(columns={'Type Label':'Type Label/Initial'})
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.index.to_series().duplicated(), '')
df = df.reset_index(drop=True)
print (df)
Type Label/Initial 2022 2023 Difference
0 APPS 500 469 31
1 A
2 B
3 C
4 BACS 5 3 2
5 B
6 C
7 D
8 CAPS 10 5 5
9 C
10 D
11 E

pandas apply User defined function to grouped dataframe on multiple columns

I would like to apply a function f1 by group to a dataframe:
import pandas as pd
import numpy as np
data = np.array([['id1','id2','u','v0','v1'],
['A','A',10,1,7],
['A','A',10,2,8],
['A','B',20,3,9],
['B','A',10,4,10],
['B','B',30,5,11],
['B','B',30,6,12]])
z = pd.DataFrame(data = data[1:,:], columns=data[0,:])
def f1(u,v):
return u*np.cumprod(v)
The result of the function depends on the column u and columns v0 or v1 (that can be thousands of v ecause I'm doing a simulation on a lot of paths).
The result should be like this
id1 id2 new_v0 new_v1
0 A A 10 70
1 A A 20 560
2 A B 60 180
3 B A 40 100
4 B B 150 330
5 B B 900 3960
I tried for a start
output = z.groupby(['id1', 'id2']).apply(lambda x: f1(u = x.u,v =x.v0))
but I can't even get a result with just one column.
Thank you very much!
You can filter column names starting with v and create a list and pass them under groupby:
v_cols = z.columns[z.columns.str.startswith('v')].tolist()
z[['u']+v_cols] = z[['u']+v_cols].apply(pd.to_numeric)
out = z.assign(**z.groupby(['id1','id2'])[v_cols].cumprod()
.mul(z['u'],axis=0).add_prefix('new_'))
print(out)
id1 id2 u v0 v1 new_v0 new_v1
0 A A 10 1 7 10 70
1 A A 10 2 8 20 560
2 A B 20 3 9 60 180
3 B A 10 4 10 40 100
4 B B 30 5 11 150 330
5 B B 30 6 12 900 3960
The way you create your data frame , will make the numeric to object , we convert first , then use the groupby+ cumprod
z[['u','v0','v1']]=z[['u','v0','v1']].apply(pd.to_numeric)
s=z.groupby(['id1','id2'])[['v0','v1']].cumprod().mul(z['u'],0)
#z=z.join(s.add_prefix('New_'))
v0 v1
0 10 70
1 20 560
2 60 180
3 40 100
4 150 330
5 900 3960
If you want to handle more than 2 v columns, it's better not to reference it.
(
z.apply(lambda x: pd.to_numeric(x, errors='ignore'))
.groupby(['id1', 'id2']).apply(lambda x: x.cumprod().mul(x.u.min()))
)

Equivalent of LIMIT and OFFSET of SQL in pandas?

I have a dataframe like this:
id type city
0 2 d H
1 7 c J
2 7 x Y
3 2 o G
4 6 i F
5 5 b E
6 6 v G
7 8 u L
8 1 g L
9 8 k U
I would like to get the similar output using pandas as in SQL command:
select id,type
from df
order by type desc
limit 4
offset 2
The required result is:
id type
0 8 u
1 2 o
2 8 k
3 6 i
I tried to follow the official tutorial https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#top-n-rows-with-offset
df.nlargest(4+2, columns='type').tail(4)
But, this fails.
How to solve the problem?
UPDATE
import numpy as np
import pandas as pd
import pandasql as pdsql
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
df = pd.read_csv('http://ourairports.com/data/airports.csv')
q = '''
select id,type
from df
order by type desc
limit 4
offset 2
'''
print(pysqldf(q))
```
id type
0 6525 small_airport
1 322127 small_airport
2 6527 small_airport
3 6528 small_airport
```
Using pandas:
print(df.sort_values('type', ascending=False).iloc[2:2+4][['id','type']])
id type
43740 37023 small_airport
43739 37022 small_airport
24046 308281 small_airport
24047 309587 small_airport
Yes, integer location, where iloc starting index is the 'offset' and ending index is incremented by 'limit':
df.sort_values('type', ascending=False).iloc[2:6]
Output:
id type city
7 8 u L
3 2 o G
9 8 k U
4 6 i F
And you can add reset_index to clean up indexing.
print(df.sort_values('type', ascending=False).iloc[2:6].reset_index(drop=True))
Output:
id type city
0 8 u L
1 2 o G
2 8 k U
3 6 i F
Update let's sort by type and index:
df.index.name = 'index'
df[['id','type']].sort_values(['type','index'], ascending=[False,True]).iloc[2:6]
Output:
index id type
0 3 6525 small_airport
1 5 322127 small_airport
2 6 6527 small_airport
3 7 6528 small_airport
You could use sort_values with ascending=False, and use .loc() to slice the result (having reset the index) with the rows and columns of interest:
offset = 2
limit = 4
(df.sort_values(by='type', ascending=False).reset_index(drop=True)
.loc[offset : offset+limit-1, ['id','type']])
id type
2 8 u
3 2 o
4 8 k
5 6 i

Using Pandas to filter Excel table

I have imported an Excel table using Pandas. The table contains four columns representing Nodes, X, Y and Z data. I used the following script:
import pandas as pd
SolidFixity = pd.read_excel('GeomData.xlsx', sheetname = 'SurfaceFixitySolid')
What I would like to do next is filter this dataframe using the filter table to select the Node of interest. I used this command:
SolidFixity.filter(like = '797', axis = 'Nodes')
This did not work and threw the following error:
ValueError: No axis named Nodes for object type
I know there is an axis named Nodes because the following command:
In[17]SolidFixity.axes
Outputs the following:
Out[17]:
[RangeIndex(start=0, stop=809, step=1),
Index(['Nodes', 'X', 'Y ', 'Z'], dtype='object')]
Nodes is right there, shimmering like the sun.
What am I doing wrong here?
It seems you need boolean indexing or query with mask by contains or compare with 797 for exact match:
SolidFixity = pd.DataFrame({'Nodes':['797','sds','797 dsd','800','s','79785'],
'X':[5,3,6,9,2,4]})
print (SolidFixity)
Nodes X
0 797 5
1 sds 3
2 797 dsd 6
3 800 9
4 s 2
5 79785 4
a = SolidFixity[SolidFixity.Nodes.str.contains('797')]
print (a)
Nodes X
0 797 5
2 797 dsd 6
5 79785 4
b = SolidFixity[SolidFixity.Nodes == '797']
print (b)
Nodes X
0 797 5
b = SolidFixity.query("Nodes =='797'")
print (b)
Nodes X
0 797 5
filter function have possible axis only values:
axis : int or string axis name
The axis to filter on. By default this is the info axis, index for Series, columns for DataFrame
and return all columns by parameters like, regex and items:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C797':[7,8,9,4,2,3],
'797':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
797 A B C797 E F
0 1 a 4 7 5 a
1 3 b 5 8 3 a
2 5 c 4 9 6 a
3 7 d 5 4 9 b
4 1 e 5 2 2 b
5 0 f 4 3 4 b
a = df.filter(like = '797', axis = 1)
#same as
#a = df.filter(like = '797', axis = 'columns')
print (a)
797 C797
0 1 7
1 3 8
2 5 9
3 7 4
4 1 2
5 0 3
c = df.filter(items = ['797'], axis = 1)
#same as
#c = df.filter(items = ['797'], axis = 'columns')
print (c)
797
0 1
1 3
2 5
3 7
4 1
5 0

Categories