I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings.
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname.
I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91).
well_cols also populates correctly as a list.
These are some of cdf column headings. Each column has 20k rows of data.
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc.
The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well):
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091
IIUC this should be enough:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
Dictionaries are usually the way to go if you want to populate something. In this case, then, if you input well_dict['N1'], you'll get your first dataframe, and so on.
The elements of an array are not mutable when iterating over it. That is, here's what it's doing based on your example:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration).
IMO, it seems like storing the dataframes in a dict would be easier to use:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
However, if you really want it in a list, you could do something like:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]
One way to approach the problem is to use pd.MultiIndex and Groupby.
You can add the construct a MultiIndex composed of well identifier and variable name. If you have df:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (i.e. a or b).
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
Which returns:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
Then you can transpose the data frame and groupby the MultiIndex level 0.
grouped = df.T.groupby(level=0)
To return a list of untransposed sub-data frames you can use:
wells = [group.T for _, group in grouped]
where wells[0] is:
N1
a b
0 2 2
1 7 8
and wells[1] is:
N2
a b
0 3 4
1 9 10
The last step is rather unnecessary because the data can be accessed from the grouped object grouped.
All together:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]
Using contains
df[df.columns.str.contains('|'.join(wells))]
Related
I'm trying to using pandas to append a blank row based on the values in the first column. When the first six characters in the first column don't match, I want an empty row between them (effectively creating groups). Here is an example of what the output could look like:
002446
002447-01
002447-02
002448
This is what I was able to put together thus far.
readie=pd.read_csv('title.csv')
i=0
for row in readie:
readie.append(row)
i+=1
if readie['column title'][i][0:5]!=readie['column title'][i+1][0:5]:
readie.append([])
When running this code, I get the following error message:
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
I believe there are other ways to do this, but I would like to use pandas if at all possible.
I'm using the approach from this answer.
Assuming that strings like '123456' and '123' are considered as not matching:
df_orig = pd.DataFrame(
{'col':['002446','002447-01','002447-02','002448','00244','002448']}
)
df = df_orig.reset_index(drop=True) # reset your index
first_6 = df['col'].str.slice(stop=6)
mask = first_6 != first_6.shift(fill_value=first_6[0])
df.index = df.index + mask.cumsum()
df = df.reindex(range(df.index[-1] + 1))
print(df)
col
0 002446
1 NaN
2 002447-01
3 002447-02
4 NaN
5 002448
6 NaN
7 00244
8 NaN
9 002448
I am trying to complete missing information in some rows from a column in a dataframe, using another dataframe. I have in the first df(dfPivote), two columns of interest 'Entrega' and 'Transportador' which is the one with missing information. I have a second df (dfTransportadoEntregadoFaltante) with two columns of interest 'EntregaBusqueda' which is the key to my other df, and 'Transportador' with the information missing from the other df. I have the following code, and it is not working. How could I solve this problem?
I would recommend using dataframe operations to fill in missing values. If I've followed your example code correctly, I think you're trying to do something like this:
import pandas as pd
import numpy as np
# Create fake data
# "dfPivote" dataframe with an empty string in the "Transportador" column:
dfPivote = pd.DataFrame({'Entrega':[1,2,3],'Transportador':['a','','c']})
# "dfTransportadoEntregadoFaltante" lookup dataframe
dfTransportadoEntregadoFaltante = pd.DataFrame({'EntregaBusqueda':[1,2,3], 'Transportador':['a','b','c']})
# 1. Replace empty strings in dfPivote['Transportador'] with np.nan values:
dfPivote['Transportador'] = dfPivote['Transportador'].apply(lambda x: np.nan if len(x)==0 else x)
# 2. Merge the two dataframes together on the "Entrega" and "EntregaBusqueda" columns respectively:
df = dfPivote.merge(dfTransportadoEntregadoFaltante, left_on='Entrega', right_on='EntregaBusqueda', how='left')
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 NaN 2 b
# 3 c 3 c
# 3. Fill NaNs in "Transportador_x" column with corresponding values in "Transportador_y" column:
df['Transportador_x'] = df['Transportador_x'].fillna(df['Transportador_y'])
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 b 2 b
# 3 c 3 c
Let's consider a DataFrame that contains 1 row of 2 values per each day of the month of Jan 2010:
date_range = pd.date_range(dt(2010,1,1), dt(2010,1,31), freq='1D')
df = pd.DataFrame(data = np.random.rand(len(date_range),2), index = date_range)
and another timeserie with sparser data and duplicate index values:
observations = pd.DataFrame(data =np.random.rand(7,2), index = (dt(2010,1,12),
dt(2010,1,18), dt(2010,1,20), dt(2010,1,20), dt(2010,1,22), dt(2010,1,22),dt(2010,1,28)))
I split the first DataFrame df into a list of 5 DataFrames, each of them containing 1 week worth of data from the original: df_weeks = [g for n, g in df.groupby(pd.TimeGrouper('W'))]
Now I would like to split the data of the second DataFrame by the same 5 weeks. i.e. that would mean in that specific case ending up with a variable obs_weeks containing 5 DataFrames spanning the same time range as df_weeks , 2 of them being empty.
I tried using reindex such as in this question: Python, Pandas: Use the GroupBy.groups description to apply it to another grouping
and Periods:
p1 =[x.to_period() for x in list(df.groupby(pd.TimeGrouper('W')).groups.keys())]
p1 = sorted(p1)
dfs=[]
for p in p1:
dff = observations.truncate(p.start_time, p.end_time)
dfs.append(dff)
(see this question: Python, Pandas: Boolean Indexing Comparing DateTimeIndex to Period)
The problem is that if some values in the index of observations are duplicates (and this is the case) none of those method functions. i also tried to change the index of observations to a normal column and do the slicing on that column, but i received error message as well.
You can achieve this by doing a simple filter:
p1 = [x.to_period() for x in list(df.groupby(pd.TimeGrouper('W')).groups.keys())]
p1 = sorted(p1)
dfs = []
for p in p1:
dff = observations.ix[
(observations.index >= p.start_time) &
(observations.index < p.end_time)]
dfs.append(dff)
I would like to reorder the rows in a dataframe based on an external mapping. So for example if the list is (2,1,3) I want to move the first item in the old df to the second item in the new df. I thought my question was the same as this: How to reorder indexed rows based on a list in Pandas data frame but that solution is not working. Here's what I've tried:
a = list(sampleinfo.filename)
b = list(exprs.columns)
matchIndex2 = [a.index(x) for x in b]
(1)
sampleinfo2 = sampleinfo[matchIndex2,]
(2)
sampleinfo2 = sampleinfo
sampleinfo2.reindex(matchIndex2)
Neither solution errors out, but the order doesn't change - it's like I haven't done anything.
I am trying to make sure that the columns in exprs and the filename values of the rows in sampleinfo are in the same order. In the solution I see online I see I can sort the columns of exprs instead:
a = list(sampleinfo.filename)
b = list(exprs.columns)
matchIndex = [b.index(x) for x in a]
exprs = exprs[matchIndex]
But I'd like to be able to sort by row. How can I do this?
The dataframes I am working with are too large to paste, but here's the general scenario:
exprs:
a1 a2 a3 a4 a5
1 2 2 2 1
4 3 2 1 1
sampleinfo:
filename otherstuff
a1 fwsegs
a5 gsgers
a3 grsgs
a2 gsgs
a4 sgs
Here's a function to re-order rows using an external list that is tied to a particular column in the data frame:
def reorder(A, column, values):
"""Re-order data frame based on a column (given in the parameter
column, which must have unique values)"""
if set(A[column]) != set(values):
raise Exception("ERROR missing values for re-ordering")
at_position = {}
index = 0;
for v in A[column]:
at_position[v] = index
index += 1
re_position = [ at_position[v] for v in values ]
return A.iloc[ re_position ];
It seems that applying functions to data frames is typically wrt series (e.g df.apply(my_fun)) and so such functions index 'one row at a time'. My question is if one can get more flexibility in the following sense: for a data frame df, write a function my_fun(row) such that we can point to rows above or below the row.
For example, start with the following:
def row_conditional(df, groupcol, appcol1, appcol2, newcol, sortcol, shift):
"""Input: df (dataframe): input data frame
groupcol, appcol1, appcol2, sortcol (str): column names in df
shift (int): integer to point to a row above or below current row
Output: df with a newcol appended based on conditions
"""
df[newcol] = '' # fill new col with blank str
list_results = []
members = set(df[groupcol])
for m in members:
df_m = df[df[groupcol]==m].sort(sortcol, ascending=True)
df_m = df_m.reset_index(drop=True)
numrows_m = df_m.shape[0]
for r in xrange(numrows_m):
# CONDITIONS, based on rows above or below
if (df_m.loc[r + shift, appcol1]>0) and (df_m.loc[r - shfit, appcol2]=='False'):
df_m.loc[r, newcol] = 'old'
else:
df_m.loc[r, newcol] = 'new'
list_results.append(df_m)
return pd.concat(list_results).reset_index(drop=True)
Then, I'd like to be able to re-write the above as:
def new_row_conditional(row, shift):
"""apply above conditions to row relative to row[shift, appcol1] and row[shift, appcol2]
"""
return new value at df.loc[row, newcol]
and finally execute:
df.apply(new_row_conditional)
Thoughts/Solutions with 'map' or 'transform' are also very welcome.
From an OO-approach, I might imagine a row of df to be treated as an object that has attributes i) a pointer to all rows above it and ii) a pointer to all rows below it. Then referencing row.above and row.below in order to assign the new value at df.loc[row, newcol]
One can always look into the enclosing execution frame:
import pandas
dataf = pandas.DataFrame({'a':(1,2,3), 'b':(4,5,6)})
import sys
def foo(roworcol):
# current index along the axis
axis_i = sys._getframe(1).f_locals['i']
# data frame the function is applied to
dataf = sys._getframe(1).f_locals['self']
axis = sys._getframe(1).f_locals['axis']
# number of elements along the chosen axis
n = dataf.shape[(1,0)[axis]]
# print where we are
print('index: %i - %i items before, %i items after' % (axis_i,
axis_i,
n-axis_i-1))
Within the function function foo there is:
roworcol the current element out of the iteration
axis the chosen axis
axis_i the index of along the chosen axis
dataf the data frame
This is all needed to point before and after in the data frame.
>>> dataf.apply(foo, axis=1)
index: 0 - 0 items before, 2 items after
index: 1 - 1 items before, 1 items after
index: 2 - 2 items before, 0 items after
A complete implementation of the specific example you added in the comments would then be:
import pandas
import sys
df = pandas.DataFrame({'a':(1,2,3,4), 'b':(5,6,7,8)})
def bar(row, k):
axis_i = sys._getframe(2).f_locals['i']
# data frame the function is applied to
dataf = sys._getframe(2).f_locals['self']
axis = sys._getframe(2).f_locals['axis']
# number of elements along the chosen axis
n = dataf.shape[(1,0)[axis]]
if axis_i == 0 or axis_i == (n-1):
res = 0
else:
res = dataf['a'][axis_i - k] + dataf['b'][axis_i + k]
return res
You'll note that whenever additional arguments are present in the signature of the function mapped we need jump 2 frames up.
>>> df.apply(bar, args=(1,), axis=1)
0 0
1 8
2 10
3 0
dtype: int64
You'll also note that the specific example you have provided can be solved by other, and possibly simpler, means. The solution above is very general in the sense that it is letting you use map while jailbreaking from the row being mapped but it may also violate assumption about what map is doing and, for example. deprive you from the opportunity to parallelize easily by assuming independent computation on the rows.
Create duplicate data frames that are index shifted, and loop over their rows in parallel.
df_pre = df.copy()
df_pre.index -= 1
result = [fun(x1, x2) for x1, x2 in zip(df_pre.iterrows(), df.iterrows()]
This assumes you actually want everything from that row. You can of course do direct operations for example
result = df_pre['col'] - df['col']
Also, there are some standard processing functions built in like diff, shift, cumsum, cumprod that do operate on adjacent rows, although the scope is limited.