How do I compare adjacent cells in excel using Python? - python

I have two columns and I am trying to return the total count of rows where both adjacent cells within the two columns are identical. I am trying to iterate through each row of two columns and compare each item in the first column with its adjacent item in the second column
i.e.
A | B
1 | 1
2 | 3
4 | 4
returns 2 for there are 2 pairs that are identical.
My two columns are Q and R, so far:
import openpyxl
excel_document = openpyxl.load_workbook('example.xlsx')
sheet = excel_document.get_sheet_by_name('Page 1')
created_closed = sheet['Q2':'R1844']
count = 0
for cell in column:
if Q[2] == R[2]: #something along the lines of this
count += 1

import pandas as pd
df = pd.read_excel('example.xlsx')
df = df[df['A'] == df['B']]
print (df.shape[0])

The answer that comes to mind is:
count = created_closed[created_closed['Q']==created_closed['R']].shape[0]
No for loop required because pandas takes care of that.

A - Try using numpy and pandas.
1- Find all intersections: This will return an array with the elements that were common in both columns
from pandas import read_excel
import numpy as np
df = read_excel('excel_data.xlsx', names=['A','B'], header=None)
np.intersect1d(df['A'], df['B'])
2- now count the length of the array
B- read each column and save to, two separate dictionary with key being position and value being the value at that position. Compare the two dictionaries and keep track of matches.
something like this, I am traveling right now so I can no test.
count = 0
col1 = {1:'a', 2:'b', 3:'c'}
col2= {1:'g', 2:'b', 3:'v'}
for key in col1.keys():
if col1[key] == col2[key]:
count = count + 1

Related

Manipulate row values based on lists

i have actually a problem and I do not know how to solve it.
I have two lists, which have always the same lengths:
max_values = [333,30,10]
min_values = [30,10,0]
every index of the lists represents the cluster number of a range of the max and the min values, so:
Index/Cluster 0: 0-10
Index/Cluster 1: 10-30
Index/Cluster 2: 30-333
Furthermore I have one dataframe as follows:
Dataframe
Within the df, I have a column called "AVG_MPH_AREA"
It should be checked between which cluster range the value is. After the "Cluster" column should be set to the correct index of the list. The old values should be dropped.
In this case it's a list of 3 clusters, but it could also be more or less...
Any idea how to switch that or with which functions?
Came up with a small function that could do the task
max_values = [333,30,10]
min_values = [30,10,0]
Make a dictionary that contains Cluster_num as key and (min_values, max_values) as value.
def temp_func(x):
# constructing the dict inside to apply this func to AVG_MPH_AREA column in dataframe
dt = {}
cluster_list=list(zip(min_values, max_values))
for i in range(len(cluster_list)):
dt[i] = cluster_list[i]
for key, value in dt.items():
x = int(round(x))
if x in list(range(value[0], value[1])):
return key
else:
continue
Now apply the function to the AVG_MPH_AREA column
df["Cluster"] = df["AVG_MPH_AREA"].apply(temp_func)
Output:
AVG_MPH_AREA cluster
0 10.770 1
1 10.770 1
2 10.780 1
3 5.780 2
4 24.960 1
5 267.865 0

How to do a double bucle to complete missing information using different dataframe

I am trying to complete missing information in some rows from a column in a dataframe, using another dataframe. I have in the first df(dfPivote), two columns of interest 'Entrega' and 'Transportador' which is the one with missing information. I have a second df (dfTransportadoEntregadoFaltante) with two columns of interest 'EntregaBusqueda' which is the key to my other df, and 'Transportador' with the information missing from the other df. I have the following code, and it is not working. How could I solve this problem?
I would recommend using dataframe operations to fill in missing values. If I've followed your example code correctly, I think you're trying to do something like this:
import pandas as pd
import numpy as np
# Create fake data
# "dfPivote" dataframe with an empty string in the "Transportador" column:
dfPivote = pd.DataFrame({'Entrega':[1,2,3],'Transportador':['a','','c']})
# "dfTransportadoEntregadoFaltante" lookup dataframe
dfTransportadoEntregadoFaltante = pd.DataFrame({'EntregaBusqueda':[1,2,3], 'Transportador':['a','b','c']})
# 1. Replace empty strings in dfPivote['Transportador'] with np.nan values:
dfPivote['Transportador'] = dfPivote['Transportador'].apply(lambda x: np.nan if len(x)==0 else x)
# 2. Merge the two dataframes together on the "Entrega" and "EntregaBusqueda" columns respectively:
df = dfPivote.merge(dfTransportadoEntregadoFaltante, left_on='Entrega', right_on='EntregaBusqueda', how='left')
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 NaN 2 b
# 3 c 3 c
# 3. Fill NaNs in "Transportador_x" column with corresponding values in "Transportador_y" column:
df['Transportador_x'] = df['Transportador_x'].fillna(df['Transportador_y'])
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 b 2 b
# 3 c 3 c

How can I count occurrences of a string in a dataframe in Python?

I'm trying to count the number of ships in a column of a dataframe. In this case I'm trying to count the number of 77Hs. I can do it for individual elements but actions on the whole column don't seem to work
E.g. This works with an individual element in my dataframe
df = pd.DataFrame({'Route':['Callais','Dover','Portsmouth'],'shipCode':[['77H','77G'],['77G'],['77H','77H']]})
df['shipCode'][2].count('77H')
But when I try and perform the action on every row using either
df['shipCode'].count('77H')
df['shipCode'].str.count('77H')
It fails with both attempts, any help on how to code this would be much appreciated
Thanks
what if you did something like this??
assuming your initial dictionary...
import pandas as pd
from collections import Counter
df = pd.DataFrame(df) #where df is the dictionary defined in OP
you can generate a Counter for all of the elements in the lists in each row like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x))
output:
Route shipCode counts
0 Callais [77H, 77G] {'77H': 1, '77G': 1}
1 Dover [77G] {'77G': 1}
2 Portsmouth [77H, 77H] {'77H': 2}
or if you want one in particular, i.e. '77H', you can do something like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x)['77H'])
output:
Route shipCode counts
0 Callais [77H, 77G] 1
1 Dover [77G] 0
2 Portsmouth [77H, 77H] 2
or even this using the first method (full Counter in each row):
[count['77H'] for count in df['counts']]
output:
[1, 0, 2]
The data frame has a shipcode column with a list of values.
First show a True or False value to identify rows that contain the string '77H' in the shipcode column.
> df['shipcode'].map(lambda val: val.count('77H')>0)
Now filter the data frame based on those True/False values obtained in the previous step.
> df[df['shipcode'].map(lambda val: val.count('77H')>0)]
Finally, get a count for all values in the data frame where the shipcode list contains a value matching '77H' using the python len method.
> len(df[df['shipcode'].map(lambda val: val.count('77H')>0)])
Another way that makes it easy to remember what's been analyzed is to create a column in the same data frame to store the True/False value. Then filter by the True/False values. It's really the same as above but a little prettier in my opinion.
> df['filter_column'] = df['shipcode'].map(lambda val: val.count('77H')>0)
> len(df[df['filter_column']])
Good luck and enjoy working with Python and Pandas to process your data!

How can I efficiently and idiomatically filter rows of PandasDF based on multiple StringMethods on a single column?

I have a Pandas DataFrame df with many columns, of which one is:
col
---
abc:kk__LL-z12-1234-5678-kk__z
def:kk_A_LL-z12-1234-5678-kk_ss_z
abc:kk_AAA_LL-z12-5678-5678-keek_st_z
abc:kk_AA_LL-xx-xxs-4rt-z12-2345-5678-ek__x
...
I am trying to fetch all records where col starts with abc: and has the first -num- between '1234' and '2345' (inclusive using a string search; the -num- parts are exactly 4 digits each).
In the case above, I'd return
col
---
abc:kk__LL-z12-1234-5678-kk__z
abc:kk_AA_LL-z12-2345-5678-ek__x
...
My current (working, I think) solution looks like:
df = df[df['col'].str.startswith('abc:')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].ge('1234')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].le('2345')]
What is a more idiomatic and efficient way to do this in Pandas?
Complex string operations are not as efficient as numeric calculations. So the following approach might be more efficient:
m1 = df['col'].str.startswith('abc')
m2 = pd.to_numeric(df['col'].str.split('-').str[2]).between(1234, 2345)
dfn = df[m1&m2]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
One way would be to use regexp and apply function. I find it easier to play with regexp in a separate function than to crowd the pandas expression.
import pandas as pd
import re
def filter_rows(string):
z = re.match(r"abc:.*-(\d+)-(\d+)-.*", string)
if z:
return 1234 <= (int(z.groups()[0])) <= 2345
else:
return False
Then use the defined function to select rows
df.loc[df['col'].apply(filter_rows)]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
Another play on regex :
#string starts with abc,greedy search,
#then look for either 1234, or 2345,
#search on for 4 digit number and whatever else after
pattern = r'(^abc.*(?<=1234-|2345-)\d{4}.*)'
df.col.str.extract(pattern).dropna()
0
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x

How to fill pandas dataframes in a loop?

I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings.
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname.
I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91).
well_cols also populates correctly as a list.
These are some of cdf column headings. Each column has 20k rows of data.
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc.
The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well):
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091
IIUC this should be enough:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
Dictionaries are usually the way to go if you want to populate something. In this case, then, if you input well_dict['N1'], you'll get your first dataframe, and so on.
The elements of an array are not mutable when iterating over it. That is, here's what it's doing based on your example:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration).
IMO, it seems like storing the dataframes in a dict would be easier to use:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
However, if you really want it in a list, you could do something like:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]
One way to approach the problem is to use pd.MultiIndex and Groupby.
You can add the construct a MultiIndex composed of well identifier and variable name. If you have df:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (i.e. a or b).
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
Which returns:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
Then you can transpose the data frame and groupby the MultiIndex level 0.
grouped = df.T.groupby(level=0)
To return a list of untransposed sub-data frames you can use:
wells = [group.T for _, group in grouped]
where wells[0] is:
N1
a b
0 2 2
1 7 8
and wells[1] is:
N2
a b
0 3 4
1 9 10
The last step is rather unnecessary because the data can be accessed from the grouped object grouped.
All together:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]
Using contains
df[df.columns.str.contains('|'.join(wells))]

Categories