i have actually a problem and I do not know how to solve it.
I have two lists, which have always the same lengths:
max_values = [333,30,10]
min_values = [30,10,0]
every index of the lists represents the cluster number of a range of the max and the min values, so:
Index/Cluster 0: 0-10
Index/Cluster 1: 10-30
Index/Cluster 2: 30-333
Furthermore I have one dataframe as follows:
Dataframe
Within the df, I have a column called "AVG_MPH_AREA"
It should be checked between which cluster range the value is. After the "Cluster" column should be set to the correct index of the list. The old values should be dropped.
In this case it's a list of 3 clusters, but it could also be more or less...
Any idea how to switch that or with which functions?
Came up with a small function that could do the task
max_values = [333,30,10]
min_values = [30,10,0]
Make a dictionary that contains Cluster_num as key and (min_values, max_values) as value.
def temp_func(x):
# constructing the dict inside to apply this func to AVG_MPH_AREA column in dataframe
dt = {}
cluster_list=list(zip(min_values, max_values))
for i in range(len(cluster_list)):
dt[i] = cluster_list[i]
for key, value in dt.items():
x = int(round(x))
if x in list(range(value[0], value[1])):
return key
else:
continue
Now apply the function to the AVG_MPH_AREA column
df["Cluster"] = df["AVG_MPH_AREA"].apply(temp_func)
Output:
AVG_MPH_AREA cluster
0 10.770 1
1 10.770 1
2 10.780 1
3 5.780 2
4 24.960 1
5 267.865 0
I would like to split each of columns in dataset.
The idea is to split the number between "/" and string between "/" and "#" and put this values to the new colums.
I tried sth like this :
new_df = dane['1: Brandenburg'].str.split('/',1)
and then creating new columns for it. But I don't want to do this for all 60 colums.
first column
1: Branburg :
ES-NL-10096/1938/X1#hkzydzon.dk/6749
BE-BR-6986/3551/B1#oqk.bf/39927
PH-SA-39552610/2436/A1#venagi.hr/80578
PA-AE-59691/4881/X1#zhicksl.cl/25247
second column
2: Achon :
DE-JP-20082/2066/A2#qwier.cu/68849
NL-LK-02276/2136/A1#ozmdpfts.de/73198
OM-PH-313/3671/Z1#jtqy.ml/52408
AE-ID-9632/3806/C3#lhbt.ar/83484
etc,etc...
As I understood, you want to extract two parts from each cell.
E.g. from ES-NL-10096/1938/X1#hkzydzon.dk/6749 there should be
extracted:
1938 - the number between slashes,
X1 - the string between the second slash and #.
To to this, you can run:
df.stack().str.extract(r'/(?P<num>\d+)/(?P<txt>[A-Z\d]+)#')\
.stack().unstack([1, 2])
You will get a MultiIndex on columns:
top level - the name of "source" column,
second level - num and txt - 2 extracted "parts".
For your sample data, the result is:
1: Brandenburg 2: Achon
num txt num txt
0 1938 X1 2066 A2
1 3551 B1 2136 A1
2 2436 A1 3671 Z1
3 4881 X1 3806 C3
You can use df.apply() to iterate over all the columns of your Dataframe and apply given function. Here is an example:
def fn(col):
return col.str.split('/',1)
new_df = dane.apply (lambda col: fn(col), axis=1)
Here axis=1 means iterate over all the columns. Hope this helps!
I have a data frame of one column consisting many values. I want to stack all those values into one cell of same or another dataframe.
column_df =
index voltage
0 5.143590
1 5.175285
2 5.231214
3 6.040188
4 7.776510
5 9.540277
6 11.476937
7 13.277916
8 15.088566
9 16.895921
10 18.701332
I want to stack column values into a dataframe cell. Finally I want to achieve something like
Expected output:
cell_df =
index voltage
0 [ 5.14359 , 5.175285, 5.231214, 6.040188, 7.77651 , 9.540277, 11.476937, 13.277916, 15.088566, 16.895921, 18.701332]
My code is:
cell_df = pd.Dataframe()
cell_df['voltage'][0] = np.array([column_df['voltage']])
Present output:
ValueError: setting an array element with a sequence.
You can cast the "voltage" series as a list and use it in your cell_df constructor:
cell_df = pd.DataFrame({"voltage": [column_df["voltage"].tolist()]})
I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings.
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname.
I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91).
well_cols also populates correctly as a list.
These are some of cdf column headings. Each column has 20k rows of data.
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc.
The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well):
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091
IIUC this should be enough:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
Dictionaries are usually the way to go if you want to populate something. In this case, then, if you input well_dict['N1'], you'll get your first dataframe, and so on.
The elements of an array are not mutable when iterating over it. That is, here's what it's doing based on your example:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration).
IMO, it seems like storing the dataframes in a dict would be easier to use:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
However, if you really want it in a list, you could do something like:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]
One way to approach the problem is to use pd.MultiIndex and Groupby.
You can add the construct a MultiIndex composed of well identifier and variable name. If you have df:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (i.e. a or b).
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
Which returns:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
Then you can transpose the data frame and groupby the MultiIndex level 0.
grouped = df.T.groupby(level=0)
To return a list of untransposed sub-data frames you can use:
wells = [group.T for _, group in grouped]
where wells[0] is:
N1
a b
0 2 2
1 7 8
and wells[1] is:
N2
a b
0 3 4
1 9 10
The last step is rather unnecessary because the data can be accessed from the grouped object grouped.
All together:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]
Using contains
df[df.columns.str.contains('|'.join(wells))]
I have two columns and I am trying to return the total count of rows where both adjacent cells within the two columns are identical. I am trying to iterate through each row of two columns and compare each item in the first column with its adjacent item in the second column
i.e.
A | B
1 | 1
2 | 3
4 | 4
returns 2 for there are 2 pairs that are identical.
My two columns are Q and R, so far:
import openpyxl
excel_document = openpyxl.load_workbook('example.xlsx')
sheet = excel_document.get_sheet_by_name('Page 1')
created_closed = sheet['Q2':'R1844']
count = 0
for cell in column:
if Q[2] == R[2]: #something along the lines of this
count += 1
import pandas as pd
df = pd.read_excel('example.xlsx')
df = df[df['A'] == df['B']]
print (df.shape[0])
The answer that comes to mind is:
count = created_closed[created_closed['Q']==created_closed['R']].shape[0]
No for loop required because pandas takes care of that.
A - Try using numpy and pandas.
1- Find all intersections: This will return an array with the elements that were common in both columns
from pandas import read_excel
import numpy as np
df = read_excel('excel_data.xlsx', names=['A','B'], header=None)
np.intersect1d(df['A'], df['B'])
2- now count the length of the array
B- read each column and save to, two separate dictionary with key being position and value being the value at that position. Compare the two dictionaries and keep track of matches.
something like this, I am traveling right now so I can no test.
count = 0
col1 = {1:'a', 2:'b', 3:'c'}
col2= {1:'g', 2:'b', 3:'v'}
for key in col1.keys():
if col1[key] == col2[key]:
count = count + 1