i have actually a problem and I do not know how to solve it.
I have two lists, which have always the same lengths:
max_values = [333,30,10]
min_values = [30,10,0]
every index of the lists represents the cluster number of a range of the max and the min values, so:
Index/Cluster 0: 0-10
Index/Cluster 1: 10-30
Index/Cluster 2: 30-333
Furthermore I have one dataframe as follows:
Dataframe
Within the df, I have a column called "AVG_MPH_AREA"
It should be checked between which cluster range the value is. After the "Cluster" column should be set to the correct index of the list. The old values should be dropped.
In this case it's a list of 3 clusters, but it could also be more or less...
Any idea how to switch that or with which functions?
Came up with a small function that could do the task
max_values = [333,30,10]
min_values = [30,10,0]
Make a dictionary that contains Cluster_num as key and (min_values, max_values) as value.
def temp_func(x):
# constructing the dict inside to apply this func to AVG_MPH_AREA column in dataframe
dt = {}
cluster_list=list(zip(min_values, max_values))
for i in range(len(cluster_list)):
dt[i] = cluster_list[i]
for key, value in dt.items():
x = int(round(x))
if x in list(range(value[0], value[1])):
return key
else:
continue
Now apply the function to the AVG_MPH_AREA column
df["Cluster"] = df["AVG_MPH_AREA"].apply(temp_func)
Output:
AVG_MPH_AREA cluster
0 10.770 1
1 10.770 1
2 10.780 1
3 5.780 2
4 24.960 1
5 267.865 0
I am trying to complete missing information in some rows from a column in a dataframe, using another dataframe. I have in the first df(dfPivote), two columns of interest 'Entrega' and 'Transportador' which is the one with missing information. I have a second df (dfTransportadoEntregadoFaltante) with two columns of interest 'EntregaBusqueda' which is the key to my other df, and 'Transportador' with the information missing from the other df. I have the following code, and it is not working. How could I solve this problem?
I would recommend using dataframe operations to fill in missing values. If I've followed your example code correctly, I think you're trying to do something like this:
import pandas as pd
import numpy as np
# Create fake data
# "dfPivote" dataframe with an empty string in the "Transportador" column:
dfPivote = pd.DataFrame({'Entrega':[1,2,3],'Transportador':['a','','c']})
# "dfTransportadoEntregadoFaltante" lookup dataframe
dfTransportadoEntregadoFaltante = pd.DataFrame({'EntregaBusqueda':[1,2,3], 'Transportador':['a','b','c']})
# 1. Replace empty strings in dfPivote['Transportador'] with np.nan values:
dfPivote['Transportador'] = dfPivote['Transportador'].apply(lambda x: np.nan if len(x)==0 else x)
# 2. Merge the two dataframes together on the "Entrega" and "EntregaBusqueda" columns respectively:
df = dfPivote.merge(dfTransportadoEntregadoFaltante, left_on='Entrega', right_on='EntregaBusqueda', how='left')
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 NaN 2 b
# 3 c 3 c
# 3. Fill NaNs in "Transportador_x" column with corresponding values in "Transportador_y" column:
df['Transportador_x'] = df['Transportador_x'].fillna(df['Transportador_y'])
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 b 2 b
# 3 c 3 c
I'm trying to count the number of ships in a column of a dataframe. In this case I'm trying to count the number of 77Hs. I can do it for individual elements but actions on the whole column don't seem to work
E.g. This works with an individual element in my dataframe
df = pd.DataFrame({'Route':['Callais','Dover','Portsmouth'],'shipCode':[['77H','77G'],['77G'],['77H','77H']]})
df['shipCode'][2].count('77H')
But when I try and perform the action on every row using either
df['shipCode'].count('77H')
df['shipCode'].str.count('77H')
It fails with both attempts, any help on how to code this would be much appreciated
Thanks
what if you did something like this??
assuming your initial dictionary...
import pandas as pd
from collections import Counter
df = pd.DataFrame(df) #where df is the dictionary defined in OP
you can generate a Counter for all of the elements in the lists in each row like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x))
output:
Route shipCode counts
0 Callais [77H, 77G] {'77H': 1, '77G': 1}
1 Dover [77G] {'77G': 1}
2 Portsmouth [77H, 77H] {'77H': 2}
or if you want one in particular, i.e. '77H', you can do something like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x)['77H'])
output:
Route shipCode counts
0 Callais [77H, 77G] 1
1 Dover [77G] 0
2 Portsmouth [77H, 77H] 2
or even this using the first method (full Counter in each row):
[count['77H'] for count in df['counts']]
output:
[1, 0, 2]
The data frame has a shipcode column with a list of values.
First show a True or False value to identify rows that contain the string '77H' in the shipcode column.
> df['shipcode'].map(lambda val: val.count('77H')>0)
Now filter the data frame based on those True/False values obtained in the previous step.
> df[df['shipcode'].map(lambda val: val.count('77H')>0)]
Finally, get a count for all values in the data frame where the shipcode list contains a value matching '77H' using the python len method.
> len(df[df['shipcode'].map(lambda val: val.count('77H')>0)])
Another way that makes it easy to remember what's been analyzed is to create a column in the same data frame to store the True/False value. Then filter by the True/False values. It's really the same as above but a little prettier in my opinion.
> df['filter_column'] = df['shipcode'].map(lambda val: val.count('77H')>0)
> len(df[df['filter_column']])
Good luck and enjoy working with Python and Pandas to process your data!
I have a Pandas DataFrame df with many columns, of which one is:
col
---
abc:kk__LL-z12-1234-5678-kk__z
def:kk_A_LL-z12-1234-5678-kk_ss_z
abc:kk_AAA_LL-z12-5678-5678-keek_st_z
abc:kk_AA_LL-xx-xxs-4rt-z12-2345-5678-ek__x
...
I am trying to fetch all records where col starts with abc: and has the first -num- between '1234' and '2345' (inclusive using a string search; the -num- parts are exactly 4 digits each).
In the case above, I'd return
col
---
abc:kk__LL-z12-1234-5678-kk__z
abc:kk_AA_LL-z12-2345-5678-ek__x
...
My current (working, I think) solution looks like:
df = df[df['col'].str.startswith('abc:')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].ge('1234')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].le('2345')]
What is a more idiomatic and efficient way to do this in Pandas?
Complex string operations are not as efficient as numeric calculations. So the following approach might be more efficient:
m1 = df['col'].str.startswith('abc')
m2 = pd.to_numeric(df['col'].str.split('-').str[2]).between(1234, 2345)
dfn = df[m1&m2]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
One way would be to use regexp and apply function. I find it easier to play with regexp in a separate function than to crowd the pandas expression.
import pandas as pd
import re
def filter_rows(string):
z = re.match(r"abc:.*-(\d+)-(\d+)-.*", string)
if z:
return 1234 <= (int(z.groups()[0])) <= 2345
else:
return False
Then use the defined function to select rows
df.loc[df['col'].apply(filter_rows)]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
Another play on regex :
#string starts with abc,greedy search,
#then look for either 1234, or 2345,
#search on for 4 digit number and whatever else after
pattern = r'(^abc.*(?<=1234-|2345-)\d{4}.*)'
df.col.str.extract(pattern).dropna()
0
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings.
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname.
I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91).
well_cols also populates correctly as a list.
These are some of cdf column headings. Each column has 20k rows of data.
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc.
The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well):
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091
IIUC this should be enough:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
Dictionaries are usually the way to go if you want to populate something. In this case, then, if you input well_dict['N1'], you'll get your first dataframe, and so on.
The elements of an array are not mutable when iterating over it. That is, here's what it's doing based on your example:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration).
IMO, it seems like storing the dataframes in a dict would be easier to use:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
However, if you really want it in a list, you could do something like:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]
One way to approach the problem is to use pd.MultiIndex and Groupby.
You can add the construct a MultiIndex composed of well identifier and variable name. If you have df:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (i.e. a or b).
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
Which returns:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
Then you can transpose the data frame and groupby the MultiIndex level 0.
grouped = df.T.groupby(level=0)
To return a list of untransposed sub-data frames you can use:
wells = [group.T for _, group in grouped]
where wells[0] is:
N1
a b
0 2 2
1 7 8
and wells[1] is:
N2
a b
0 3 4
1 9 10
The last step is rather unnecessary because the data can be accessed from the grouped object grouped.
All together:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]
Using contains
df[df.columns.str.contains('|'.join(wells))]