Using Pandas to filter Excel table

Using Pandas to filter Excel table - python

I have imported an Excel table using Pandas. The table contains four columns representing Nodes, X, Y and Z data. I used the following script:
import pandas as pd
SolidFixity = pd.read_excel('GeomData.xlsx', sheetname = 'SurfaceFixitySolid')
What I would like to do next is filter this dataframe using the filter table to select the Node of interest. I used this command:
SolidFixity.filter(like = '797', axis = 'Nodes')
This did not work and threw the following error:
ValueError: No axis named Nodes for object type
I know there is an axis named Nodes because the following command:
In[17]SolidFixity.axes
Outputs the following:
Out[17]:
[RangeIndex(start=0, stop=809, step=1),
Index(['Nodes', 'X', 'Y ', 'Z'], dtype='object')]
Nodes is right there, shimmering like the sun.
What am I doing wrong here?

It seems you need boolean indexing or query with mask by contains or compare with 797 for exact match:
SolidFixity = pd.DataFrame({'Nodes':['797','sds','797 dsd','800','s','79785'],
'X':[5,3,6,9,2,4]})
print (SolidFixity)
Nodes X
0 797 5
1 sds 3
2 797 dsd 6
3 800 9
4 s 2
5 79785 4
a = SolidFixity[SolidFixity.Nodes.str.contains('797')]
print (a)
Nodes X
0 797 5
2 797 dsd 6
5 79785 4
b = SolidFixity[SolidFixity.Nodes == '797']
print (b)
Nodes X
0 797 5
b = SolidFixity.query("Nodes =='797'")
print (b)
Nodes X
0 797 5
filter function have possible axis only values:
axis : int or string axis name
The axis to filter on. By default this is the info axis, index for Series, columns for DataFrame
and return all columns by parameters like, regex and items:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C797':[7,8,9,4,2,3],
'797':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
797 A B C797 E F
0 1 a 4 7 5 a
1 3 b 5 8 3 a
2 5 c 4 9 6 a
3 7 d 5 4 9 b
4 1 e 5 2 2 b
5 0 f 4 3 4 b
a = df.filter(like = '797', axis = 1)
#same as
#a = df.filter(like = '797', axis = 'columns')
print (a)
797 C797
0 1 7
1 3 8
2 5 9
3 7 4
4 1 2
5 0 3
c = df.filter(items = ['797'], axis = 1)
#same as
#c = df.filter(items = ['797'], axis = 'columns')
print (c)
797
0 1
1 3
2 5
3 7
4 1
5 0

Related

Setting index as multindex format without being multindex

I have this data:
df = pd.DataFrame({'X1':[0,5,4,8,9,0,7,6],
'X2':[4,1,3,5,6,2,3,3],
'X3':['A','A','B','B','B','C','C','C']})
df = df.set_index('X3')
I want to set X3 as index so I would do:
df = df.set_index('X3')
And the result is:
However, I'm looking to set the same index but as a multi-index format (even if it's not a multi-index). This is the expected result:
X1 X2
A 0 4
5 1
B 4 3
8 5
9 6
C 0 2
7 3
6 3
Is that possible?
EDIT
Answering the comments, the reason I want to achieve this is that I want to order df by X1 values without losing "the grouping" of X3, so I would be able to see the order in each X3 group. I can't do it with sort_values(['X3', 'X1'], ascending=[False, False]) because the first condition of sorting must be the maximum of each group keeping all rows from the same group together. So I can see the group that has the maximum X1 and immediately see how are the other values of the same group, then the second group contains the second maximum of X1 and immediately see how are the other values of the second group, and so on.

Not sure if this helps; but what if you first get rid of repeated values from 'X3' columns then set that as index...
import pandas as pd
df = pd.DataFrame({'X1':[0,5,4,8,9,0,7,6],
'X2':[4,1,3,5,6,2,3,3],
'X3':['A','A','B','B','B','C','C','C']})
lst_index = []
for x in df["X3"]:
if x in lst_index:
lst_index.append("")
else:
lst_index.append(x)
del df["X3"] # delete this column if not required anymore...
df.index = lst_index
# Output
X1 X2
A 0 4
5 1
B 4 3
8 5
9 6
C 0 2
7 3
6 3

try:
df
X1 X2 X3
0 0 4 A
1 5 1 A
2 4 3 B
3 8 5 B
4 9 6 B
5 0 2 C
6 7 3 C
7 6 3 C
df = df.set_index('X3')
new_index = pd.Series([i if not j else '' for i, j in list(zip(df.index, df.index.duplicated())) ])
df = df.set_index(new_index)
X1 X2
A 0 4
5 1
B 4 3
8 5
9 6
C 0 2
7 3
6 3

How to pass a value from one row to the next one in pandas + python and use it to calculate the same following value recursively

This is my desired output:
I am trying to calculate the column df[Value] and df[Value_Compensed]. However, to do that, I need to consider the previous value of the row df[Value_Compensed]. In terms of my table:
The first row all the values are 0
The following rows: df[Remained] = previous df[Value_compensed]. Then df[Value] = df[Initial_value] + df[Remained]. Then df[Value_Compensed] = df[Value] - df[Compensation]
...And So on...
I am struggling to pass the value of Value_Compensed from one row to the next, I tried with the function shift() but as you can see in the following image the values in df[Value_Compensed] are not correct due to it is not a static value and also it also changes after each row it did not work. Any Idea??
Thanks.
Manuel.

You can use apply to create your customised operations. I've made a dummy dataset as you didn't provide the initial dataframe.
from itertools import zip_longest
# dummy data
df = pd.DataFrame(np.random.randint(1, 10, (8, 5)),
columns=['compensation', 'initial_value',
'remained', 'value', 'value_compensed'],)
df.loc[0] = 0,0,0,0,0
>>> print(df)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 2 9 1 9 7
2 1 4 9 8 3
3 3 4 5 7 6
4 3 2 5 5 6
5 9 1 5 2 4
6 4 5 9 8 2
7 1 6 9 6 8
Use apply (axis=1) to do row-wise iteration, where you use the initial dataframe as an argument, from which you can then get the previous row x.name-1 and do your calculations. Not sure if I fully understood the intended result, but you can adjust the individual calculations of the different columns in the function.
def f(x, data):
if x.name == 0:
return [0,]*data.shape[1]
else:
x_remained = data.loc[x.name-1]['value_compensed']
x_value = data.loc[x.name-1]['initial_value'] + x_remained
x_compensed = x_value - x['compensation']
return [x['compensation'], x['initial_value'], x_remained, \
x_value, x_compensed]
adj = df.apply(f, args=(df,), axis=1)
adj = pd.DataFrame.from_records(zip_longest(*adj.values), index=df.columns).T
>>> print(adj)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 5 9 0 0 -5
2 5 7 4 13 8
3 7 9 1 8 1
4 6 6 5 14 8
5 4 9 6 12 8
6 2 4 2 11 9
7 9 2 6 10 1

Selecting rows with the highest value based on 1 column in the dataframe

I have a set of dataframe with about 20k rows. with headings X,Y,Z,I,R,G,B. ( yes its point cloud)
I would wanna create numerous sub dataframes by grouping the data in rows of 100 after sorting out according to column X.
Subsequently i would like to sort all sub dataframes according to Y column and breaking them down further into rows of 50. (breaking each sub dataframe down further)
The end result is I should have a group of sub dataframes in rows of 50, and i would like to pick out all the rows with the highest Z value in each sub dataframe and write them onto a CSV file.
I have reached the following method with my code. But i am not sure how to continue further.
import pandas as pd
headings = ['x', 'y', 'z']
data = pd.read_table('file.csv', sep=',', skiprows=[0], names=headings)
points = data.sort_values(by=['x'])

Considering a dummy dataframe of 1000 rows,
df.head() # first 5 rows
X Y Z I R G B
0 6 6 0 3 7 0 2
1 0 8 3 6 5 9 7
2 8 9 7 3 0 4 5
3 9 6 8 5 1 0 0
4 9 0 3 0 9 2 9
First, extract the highest value of Z from the dataframe,
z_max = df['Z'].max()
df = df.sort_values('X')
# list of dataframes
dfs_X = np.split(df, len(df)/ 100)
results = pd.DataFrame()
for idx, df_x in enumerate(dfs_X):
dfs_X[idx] = df_x.sort_values('Y')
dfs_Y = np.split(dfs_X[idx], len(dfs_X[idx]) / 50)
for idy, df_y in enumerate(dfs_Y):
rows = df_y[df_y['Z'] == z_max]
results = results.append(rows)
results.head()
results will contain rows from all dataframes which have highest value of Z.
Output: First 5 rows
X Y Z I R G B
541 0 0 9 0 3 6 2
610 0 2 9 3 0 7 6
133 0 4 9 3 3 9 9
731 0 5 9 5 1 0 2
629 0 5 9 0 9 7 7
Now, write this dataframe to csv using df.to_csv().

Pandas groupby treat nonconsecutive as different variables?

I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
np.arange(1, 11), np.ones(10)]).T
df.columns = ['id', 'stamp', 'increment']
df_result = pd.DataFrame([ np.array(['a','b','c','b','a']),
np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T
df_result.columns = ['id', 'stamp', 'increment_sum']
In [2]: df
Out[2]:
id stamp increment
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 1
4 c 5 1
5 b 6 1
6 b 7 1
7 a 8 1
8 a 9 1
9 a 10 1
In [3]: df_result
Out[3]:
id stamp increment_sum
0 a 1 3
1 b 4 1
2 c 5 1
3 b 6 2
4 a 8 3
I can accomplish this via
def get_result(d):
sum = d.increment.sum()
stamp = d.stamp.min()
name = d.id.max()
return name, stamp, sum
#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()
result = zip(*df.groupby([df.key]).apply(get_result))
df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']
But I'm sure there must be a more elegant solution

Not that good in terms of optimum code, but solves the problem
> df_group = df.groupby('id')
we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not
> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
id stamp increment group_diff
0 a 1 1 0
1 a 2 1 0
2 a 3 1 0
3 b 4 1 0
4 c 5 1 0
5 b 6 1 2
6 b 7 1 2
7 a 8 1 5
8 a 9 1 5
9 a 10 1 5
Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function
> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')
id stamp increment
0 a 1 3
2 b 4 1
4 c 5 1
3 b 6 2
1 a 8 3

Python pandas, multindex, slicing

I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88

This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88

Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Pandas to filter Excel table - python

Related

Setting index as multindex format without being multindex

How to pass a value from one row to the next one in pandas + python and use it to calculate the same following value recursively

Selecting rows with the highest value based on 1 column in the dataframe

Pandas groupby treat nonconsecutive as different variables?

Python pandas, multindex, slicing

Categories

Resources