how to reorder pandas data frame rows based on external index

how to reorder pandas data frame rows based on external index - python

I would like to reorder the rows in a dataframe based on an external mapping. So for example if the list is (2,1,3) I want to move the first item in the old df to the second item in the new df. I thought my question was the same as this: How to reorder indexed rows based on a list in Pandas data frame but that solution is not working. Here's what I've tried:
a = list(sampleinfo.filename)
b = list(exprs.columns)
matchIndex2 = [a.index(x) for x in b]
(1)
sampleinfo2 = sampleinfo[matchIndex2,]
(2)
sampleinfo2 = sampleinfo
sampleinfo2.reindex(matchIndex2)
Neither solution errors out, but the order doesn't change - it's like I haven't done anything.
I am trying to make sure that the columns in exprs and the filename values of the rows in sampleinfo are in the same order. In the solution I see online I see I can sort the columns of exprs instead:
a = list(sampleinfo.filename)
b = list(exprs.columns)
matchIndex = [b.index(x) for x in a]
exprs = exprs[matchIndex]
But I'd like to be able to sort by row. How can I do this?
The dataframes I am working with are too large to paste, but here's the general scenario:
exprs:
a1 a2 a3 a4 a5
1 2 2 2 1
4 3 2 1 1
sampleinfo:
filename otherstuff
a1 fwsegs
a5 gsgers
a3 grsgs
a2 gsgs
a4 sgs

Here's a function to re-order rows using an external list that is tied to a particular column in the data frame:
def reorder(A, column, values):
"""Re-order data frame based on a column (given in the parameter
column, which must have unique values)"""
if set(A[column]) != set(values):
raise Exception("ERROR missing values for re-ordering")
at_position = {}
index = 0;
for v in A[column]:
at_position[v] = index
index += 1
re_position = [ at_position[v] for v in values ]
return A.iloc[ re_position ];

Related

Manipulate row values based on lists

i have actually a problem and I do not know how to solve it.
I have two lists, which have always the same lengths:
max_values = [333,30,10]
min_values = [30,10,0]
every index of the lists represents the cluster number of a range of the max and the min values, so:
Index/Cluster 0: 0-10
Index/Cluster 1: 10-30
Index/Cluster 2: 30-333
Furthermore I have one dataframe as follows:
Dataframe
Within the df, I have a column called "AVG_MPH_AREA"
It should be checked between which cluster range the value is. After the "Cluster" column should be set to the correct index of the list. The old values should be dropped.
In this case it's a list of 3 clusters, but it could also be more or less...
Any idea how to switch that or with which functions?

Came up with a small function that could do the task
max_values = [333,30,10]
min_values = [30,10,0]
Make a dictionary that contains Cluster_num as key and (min_values, max_values) as value.
def temp_func(x):
# constructing the dict inside to apply this func to AVG_MPH_AREA column in dataframe
dt = {}
cluster_list=list(zip(min_values, max_values))
for i in range(len(cluster_list)):
dt[i] = cluster_list[i]
for key, value in dt.items():
x = int(round(x))
if x in list(range(value[0], value[1])):
return key
else:
continue
Now apply the function to the AVG_MPH_AREA column
df["Cluster"] = df["AVG_MPH_AREA"].apply(temp_func)
Output:
AVG_MPH_AREA cluster
0 10.770 1
1 10.770 1
2 10.780 1
3 5.780 2
4 24.960 1
5 267.865 0

How do I split multiple columns?

I would like to split each of columns in dataset.
The idea is to split the number between "/" and string between "/" and "#" and put this values to the new colums.
I tried sth like this :
new_df = dane['1: Brandenburg'].str.split('/',1)
and then creating new columns for it. But I don't want to do this for all 60 colums.
first column
1: Branburg :
ES-NL-10096/1938/X1#hkzydzon.dk/6749
BE-BR-6986/3551/B1#oqk.bf/39927
PH-SA-39552610/2436/A1#venagi.hr/80578
PA-AE-59691/4881/X1#zhicksl.cl/25247
second column
2: Achon :
DE-JP-20082/2066/A2#qwier.cu/68849
NL-LK-02276/2136/A1#ozmdpfts.de/73198
OM-PH-313/3671/Z1#jtqy.ml/52408
AE-ID-9632/3806/C3#lhbt.ar/83484
etc,etc...

As I understood, you want to extract two parts from each cell.
E.g. from ES-NL-10096/1938/X1#hkzydzon.dk/6749 there should be
extracted:
1938 - the number between slashes,
X1 - the string between the second slash and #.
To to this, you can run:
df.stack().str.extract(r'/(?P<num>\d+)/(?P<txt>[A-Z\d]+)#')\
.stack().unstack([1, 2])
You will get a MultiIndex on columns:
top level - the name of "source" column,
second level - num and txt - 2 extracted "parts".
For your sample data, the result is:
1: Brandenburg 2: Achon
num txt num txt
0 1938 X1 2066 A2
1 3551 B1 2136 A1
2 2436 A1 3671 Z1
3 4881 X1 3806 C3

You can use df.apply() to iterate over all the columns of your Dataframe and apply given function. Here is an example:
def fn(col):
return col.str.split('/',1)
new_df = dane.apply (lambda col: fn(col), axis=1)
Here axis=1 means iterate over all the columns. Hope this helps!

How to stack column values in a single Dataframe cell

I have a data frame of one column consisting many values. I want to stack all those values into one cell of same or another dataframe.
column_df =
index voltage
0 5.143590
1 5.175285
2 5.231214
3 6.040188
4 7.776510
5 9.540277
6 11.476937
7 13.277916
8 15.088566
9 16.895921
10 18.701332
I want to stack column values into a dataframe cell. Finally I want to achieve something like
Expected output:
cell_df =
index voltage
0 [ 5.14359 , 5.175285, 5.231214, 6.040188, 7.77651 , 9.540277, 11.476937, 13.277916, 15.088566, 16.895921, 18.701332]
My code is:
cell_df = pd.Dataframe()
cell_df['voltage'][0] = np.array([column_df['voltage']])
Present output:
ValueError: setting an array element with a sequence.

You can cast the "voltage" series as a list and use it in your cell_df constructor:
cell_df = pd.DataFrame({"voltage": [column_df["voltage"].tolist()]})

How to fill pandas dataframes in a loop?

I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings.
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname.
I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91).
well_cols also populates correctly as a list.
These are some of cdf column headings. Each column has 20k rows of data.
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc.
The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well):
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091

IIUC this should be enough:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
Dictionaries are usually the way to go if you want to populate something. In this case, then, if you input well_dict['N1'], you'll get your first dataframe, and so on.

The elements of an array are not mutable when iterating over it. That is, here's what it's doing based on your example:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration).
IMO, it seems like storing the dataframes in a dict would be easier to use:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
However, if you really want it in a list, you could do something like:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]

One way to approach the problem is to use pd.MultiIndex and Groupby.
You can add the construct a MultiIndex composed of well identifier and variable name. If you have df:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (i.e. a or b).
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
Which returns:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
Then you can transpose the data frame and groupby the MultiIndex level 0.
grouped = df.T.groupby(level=0)
To return a list of untransposed sub-data frames you can use:
wells = [group.T for _, group in grouped]
where wells[0] is:
N1
a b
0 2 2
1 7 8
and wells[1] is:
N2
a b
0 3 4
1 9 10
The last step is rather unnecessary because the data can be accessed from the grouped object grouped.
All together:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]

Using contains
df[df.columns.str.contains('|'.join(wells))]

How do I compare adjacent cells in excel using Python?

I have two columns and I am trying to return the total count of rows where both adjacent cells within the two columns are identical. I am trying to iterate through each row of two columns and compare each item in the first column with its adjacent item in the second column
i.e.
A | B
1 | 1
2 | 3
4 | 4
returns 2 for there are 2 pairs that are identical.
My two columns are Q and R, so far:
import openpyxl
excel_document = openpyxl.load_workbook('example.xlsx')
sheet = excel_document.get_sheet_by_name('Page 1')
created_closed = sheet['Q2':'R1844']
count = 0
for cell in column:
if Q[2] == R[2]: #something along the lines of this
count += 1

import pandas as pd
df = pd.read_excel('example.xlsx')
df = df[df['A'] == df['B']]
print (df.shape[0])

The answer that comes to mind is:
count = created_closed[created_closed['Q']==created_closed['R']].shape[0]
No for loop required because pandas takes care of that.

A - Try using numpy and pandas.
1- Find all intersections: This will return an array with the elements that were common in both columns
from pandas import read_excel
import numpy as np
df = read_excel('excel_data.xlsx', names=['A','B'], header=None)
np.intersect1d(df['A'], df['B'])
2- now count the length of the array
B- read each column and save to, two separate dictionary with key being position and value being the value at that position. Compare the two dictionaries and keep track of matches.
something like this, I am traveling right now so I can no test.
count = 0
col1 = {1:'a', 2:'b', 3:'c'}
col2= {1:'g', 2:'b', 3:'v'}
for key in col1.keys():
if col1[key] == col2[key]:
count = count + 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to reorder pandas data frame rows based on external index - python

Related

Manipulate row values based on lists

How do I split multiple columns?

How to stack column values in a single Dataframe cell

How to fill pandas dataframes in a loop?

How do I compare adjacent cells in excel using Python?

Categories

Resources