Different ways of iterating through pandas DataFrame - python

I am currently working on a short pandas project. The project assessment keeps marking this task as incorrect for me even though the resulting list appears to be the same as when the provided correct code is used. Is my code wrong and it just happens to give the same results for this particular DataFrame?
My code:
# Define an empty list
colors = []
# Iterate over rows of netflix_movies_col_subset
for t in netflix_movies_col_subset['genre']:
if t == 'Children' :
colors.append('red')
elif t == 'Documentaries' :
colors.append('blue')
elif t == 'Stand-up' :
colors.append('green')
else:
colors.append('black')
# Inspect the first 10 values in your list
print(colors[:10])
Provided code:
# Define an empty list
colors = []
# Iterate over rows of netflix_movies_col_subset
for lab, row in netflix_movies_col_subset.iterrows():
if row['genre'] == 'Children' :
colors.append('red')
elif row['genre'] == 'Documentaries' :
colors.append('blue')
elif row['genre'] == 'Stand-up' :
colors.append('green')
else:
colors.append('black')
# Inspect the first 10 values in your list
print(colors[0:10])

I've always been told, that the best way to iterate over a dataframe row by row is NOT TO DO IT.
I your case, you could very nicely use df.ne()
First create a dataframe that holds all genres (df_genres)
then use
netflix_movies_col_subset['genre'].ne(df_genres, axis=0)
this should create a dataframe that has a line for every movie and columns for every genre. If a certain movie is a documentary, values in all columns would be False, only in the Documentary column it would be True.
This method is by multiple orders of magnitude faster than iterating with multiple if statements.

Does this help? I haven't test it yet.
# Define an empty list
colors = []
# Iterate over rows of netflix_movies_col_subset
for t in netflix_movies_col_subset['genre']:
if t == 'Children' :
x='red'
elif t == 'Documentaries' :
x= 'blue'
elif t == 'Stand-up' :
x ='green'
else:
x ='black'
colors.append(x)
# Inspect the first 10 values in your list
print(colors[:10])
Or you can do match case.
# Define an empty list
colors = []
# Iterate over rows of netflix_movies_col_subset
for t in netflix_movies_col_subset['genre']:
match t:
case 'Children':
x ='red'
case 'Documentaries':
x ='blue'
case 'Stand-up':
x ='green'
else:
x ='black'
colors.appent(x)
# Inspect the first 10 values in your list
print(colors[:10])

Related

Find list values in a column to delete odd ones out with Openpyxl

I have two workbooks and Im looking to grab both of their column A to compare the cell values to see if there is a discrepancy.
If the column A (in workbook1) != column A(in workbooks2) delete the value in workbook1.
Heres what I have so far
book1_list = []
book2_list = []
tempList = []
column_name = 'Numbers'
skip_Head_of_anotherSheet = anotherSheet[2: anotherSheet.max_row]
skip_Head_of_other = sheets[2: sheets.max_row]
for val1 in skip_Head_of_other:
book1_list.append(val1[0].value)
for val2 in skip_Head_of_anotherSheet:
book2_list.append(val2[0].value)
for i in book1_list:
for j in book2_list:
if i == j:
tempList.append(j)
print(j)
Here is where I get stuck -
for temp in tempList:
for pointValue in skip_Head_of_anotherSheet:
if temp != pointValue[0].value:
anotherSheet.cell(column=4, row =pointValue[1].row, value ="YES")
# else:
#if temp != pointValue[0].value:
#anotherSheet.cell(column=4, row =pointValue[1].row, value ="YES")
# anotherSheet.delete_rows(pointValue[0])
#anotherSheet.delete_rows(row[0].row,1)
I also attempted to include to find the column by name:
for col in script.iter_cols():
# see if the value of the first cell matches
if col[0].value == column_value:
# this is the column we want, this col is an iterable of cells:
for cell in col:
# do something with the cell in this column here
I'm not quite sure I understand what you want to do but the following might help. When you want to check for membership in Python use dictionaries and sets.
source = wb1["sheet"]
comparison = wb2["sheet"]
# create dictionaries of the cells in the worksheets keyed by cell value
source_cells = {row[0].value:row[0] for row in source.iter_rows(min_row=2, max_col=1)}
comparison_cells = {row[0].value:row[0] for row in comparison.iter_rows(min_row=2, max_col=1)}
shared = source_cells & comparison_cells # create a set of values in both sheets
missing = comparison_cells - source_cells # create a set of values only the other sheet
for value in shared:
cell = source_cells[value]
cell.offset(column=3).value = "YES"
to_remove = [comparison_cells[value].row for value in missing] # list of rows to be removed
for r in reversed(to_remove): # always remove rows from the bottom first
comparison.delete_rows(row=r)
You'll probably need to adjust this to suit your needs but I hope it helps.
A dictionary solved the issue:
I turned the tempList into a tempDict like so:
comp = dict.fromkeys(tempList)
So now it will return a dictionary.
I then instead of looping tempList I only looped the sheet.
Then in the if statement i checked if the value is in the directory.
for pointValue in skip_Head_of_anotherSheet:
if pointValue[21].value in comp:
#anotherSheet.cell(column=23, row=pointValue[21].row, value="YES")
anotherSheet.delete_rows(pointValue[21].row,1)
if pointValue[21].value not in comp:
#anotherSheet.cell(column=23, row=pointValue[21].row, value="NO")

pandas: calculate overlapping words between rows only if values in another column match (issue with multiple instances)

I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','order call','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['order','call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
#gold_cy 's answer has helped me and i made some changes to it to get the output i like:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
the issue is that when there are more instances of the same intent, i run into the error:
ValueError: too many values to unpack (expected 2)
and I do not know how to handle that for many more examples that i have in my dataset
Do you want this?
from itertools import combinations
from operator import itemgetter
items_to_consider = []
for item in list(combinations(zip(df.Sent.values, map(set,df.key_words.values)),2)):
keywords = (list(map(itemgetter(1),item)))
intersect = keywords[0].intersection(keywords[1])
if len(intersect) > 0:
str_list = list(map(itemgetter(0),item))
str_list.append(intersect)
items_to_consider.append(str_list)
for i in items_to_consider:
for item in i[2]:
if item in i[0] and item in i[1]:
print(f"Overlap of intent (order_food) for ({i[0]}) and ({i[1]}) is {item}")

Counting combinations in Dataframe create new Dataframe

So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

How to time-efficiently remove values next to 'NaN' values?

I'm trying to remove wrong values form my data (a series of 15mln values, 700MB). The values to be removed are values next to 'nan' values, e.g.:
Series: /1/,nan,/2/,3,/4/,nan,nan,nan,/8/,9
Numbers surrounded by slashes i.e. /1/,/2/,/4/,/8/ are values, which should be removed.
The problem is that it takes way too long to compute that with the following code that I have:
%%time
import numpy as np
import pandas as pd
# sample data
speed = np.random.uniform(0,25,15000000)
next_speed = speed[1:]
# create a dataframe
data_dict = {'speed': speed[:-1],
'next_speed': next_speed}
df = pd.DataFrame(data_dict)
# calculate difference between the current speed and the next speed
list_of_differences = []
for i in df.index:
difference = df.next_speed[i]-df.speed[i]
list_of_differences.append(difference)
df['difference'] = list_of_differences
# add 'nan' to data in form of a string.
for i in range(len(df.difference)):
# arbitrary condition
if df.difference[i] < -2:
df.difference[i] = 'nan'
#########################################
# THE TIME-INEFFICIENT LOOP
# remove wrong values before and after 'nan'.
for i in range(len(df)):
# check if the value is a number to skip computations of the following "if" cases
if not(isinstance(df.difference[i], str)):
continue
# case 1: where there's only one 'nan' surrounded by values.
# Without this case the algo will miss some wrong values because 'nan' will be removed
# Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
# A number surrounded by slashes e.g. /1/ is a value to be removed
if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
df.difference[i-1]= 'wrong'
df.difference[i+1]= 'wrong'
# case 2: where the following values are 'nan': /1/, nan, nan, 4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
df.difference[i-1]= 'wrong'
# case 3: where next value is NOT 'nan' wrong, nan,nan,4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
df.difference[i+1]= 'wrong'
How to make it more time-efficient?
This is still a work in progress for me. I knocked 100x off your dummy data size to get down to something I could stand to wait for.
I also added this code at the top of my version:
import time
current_milli_time = lambda: int(round(time.time() * 1000))
def mark(s):
print("[{}] {}".format(current_milli_time()/1000, s))
This just prints a string with a time-mark in front of it, to see what's taking so long.
With that done, in your 'difference' column computation, you can replace the manual list generation with a vector operation. This code:
df = pd.DataFrame(data_dict)
mark("Got DataFrame")
# calculate difference between the current speed and the next speed
list_of_differences = []
for i in df.index:
difference = df.next_speed[i]-df.speed[i]
list_of_differences.append(difference)
df['difference'] = list_of_differences
mark("difference 1")
df['difference2'] = df['next_speed'] - df['speed']
mark('difference 2')
print(df[:10])
Produces this output:
[1490943913.921] Got DataFrame
[1490943922.094] difference 1
[1490943922.096] difference 2
next_speed speed difference difference2
0 18.008314 20.182982 -2.174669 -2.174669
1 14.736095 18.008314 -3.272219 -3.272219
2 5.352993 14.736095 -9.383102 -9.383102
3 5.854199 5.352993 0.501206 0.501206
4 2.003826 5.854199 -3.850373 -3.850373
5 12.736061 2.003826 10.732236 10.732236
6 2.512623 12.736061 -10.223438 -10.223438
7 18.224716 2.512623 15.712093 15.712093
8 14.023848 18.224716 -4.200868 -4.200868
9 15.991590 14.023848 1.967741 1.967741
Notice that the two difference columns are the same, but the second version took about 8 seconds less time. (Presumably 800 seconds when you have 100x more data.)
I did the same thing in the 'nanify' process:
df.difference2[df.difference2 < -2] = np.nan
The idea here is that many of the binary operators actually generate either a placeholder, or a Series or vector. And that can be used as an index, so that df.difference2 < -2 becomes (in essence) a list of the places where that condition is true, and you can then index either df (the whole table) or any of the columns of df, like df.difference2, using that index. It's a fast shorthand for the otherwise-slow python for loop.
Update
Okay, finally, here is a version that vectorizes the "Time-inefficient Loop". I'm just pasting the whole thing in at the bottom, for copying.
The premise is that the Series.isnull() method returns a boolean Series (column) that is true if the contents are "missing" or "invalid" or "bogus." Generally, this means NaN, but it also recognizes Python None, etc.
The tricky part, in pandas, is shifting that column up or down by one to reflect "around"-ness.
That is, I want another boolean column, where col[n-1] is true if col[n] is null. That's my "before a nan" column. And likewise, I want another column where col[n+1] is true if col[n] is null. That's my "after a nan" column.
It turns out I had to take the damn thing apart! I had to reach in, extract the underlying numpy array using the Series.values attribute, so that the pandas index would be discarded. Then a new index is created, starting at 0, and everything works again. (If you don't strip the index, the columns "remember" what their numbers are supposed to be. So even if you delete column[0], the column doesn't shift down. Instead, is knows "I am missing my [0] value, but everyone else is still in the right place!")
Anyway, with that figured out, I was able to build three columns (needlessly - they could probably be parts of an expression) and then merge them together into a fourth column that indicates what you want: the column is True when the row is before, on, or after a nan value.
missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan
Here's the whole thing:
import numpy as np
import pandas as pd
import time
current_milli_time = lambda: int(round(time.time() * 1000))
def mark(s):
print("[{}] {}".format(current_milli_time()/1000, s))
# sample data
speed = np.random.uniform(0,25,150000)
next_speed = speed[1:]
# create a dataframe
data_dict = {'speed': speed[:-1],
'next_speed': next_speed}
df = pd.DataFrame(data_dict)
mark("Got DataFrame")
# calculate difference between the current speed and the next speed
list_of_differences = []
#for i in df.index:
#difference = df.next_speed[i]-df.speed[i]
#list_of_differences.append(difference)
#df['difference'] = list_of_differences
#mark("difference 1")
df['difference'] = df['next_speed'] - df['speed']
mark('difference 2')
df['difference2'] = df['next_speed'] - df['speed']
# add 'nan' to data in form of a string.
#for i in range(len(df.difference)):
## arbitrary condition
#if df.difference[i] < -2:
#df.difference[i] = 'nan'
df.difference[df.difference < -2] = np.nan
mark('nanify')
df.difference2[df.difference2 < -2] = np.nan
mark('nanify 2')
missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan
mark('looped')
#########################################
# THE TIME-INEFFICIENT LOOP
# remove wrong values before and after 'nan'.
for i in range(len(df)):
# check if the value is a number to skip computations of the following "if" cases
if not(isinstance(df.difference[i], str)):
continue
# case 1: where there's only one 'nan' surrounded by values.
# Without this case the algo will miss some wrong values because 'nan' will be removed
# Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
# A number surrounded by slashes e.g. /1/ is a value to be removed
if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
df.difference[i-1]= 'wrong'
df.difference[i+1]= 'wrong'
# case 2: where the following values are 'nan': /1/, nan, nan, 4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
df.difference[i-1]= 'wrong'
# case 3: where next value is NOT 'nan' wrong, nan,nan,4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
df.difference[i+1]= 'wrong'
mark('time-inefficient loop done')
I am assuming that you don't want either 'nan' or wrong values and nan values are not much compared to size of data. Please try with this:
nan_idx = df[df['difference']=='nan'].index.tolist()
from copy import deepcopy
drop_list = deepcopy(nan_idx)
for i in nan_idx:
if (i+1) not in(drop_list) and (i+1) < len(df):
mm.append(i+1)
if (i-1) not in(drop_list) and (i-1) < len(df):
mm.append(i-1)
df.drop(df.index[drop_list])
if nan is not a string but it is NaN which is for missing values then use this to get its indexes:
nan_idx = df[pandas.isnull(df['difference'])].index.tolist()

Categories