I have 2 columns in my dataframe, one called 'Subreddits' which lists string values, and one called 'Appearances' which lists how many times they appear.
I am trying to add 1 to the value of a certain line in the 'Appearances' column when it detects a string value that is already in the dataframe.
df = pd.read_csv(Location)
print(len(elem))
while counter < 50:
#gets just the subreddit name
e = str(elem[counter].get_attribute("href"))
e = e.replace("https://www.reddit.com/r/", "")
e = e[:-1]
inDf = None
if (any(df.Subreddit == e)):
print("Y")
inDf = True
if inDf:
#adds 1 to the value of Appearances
#df.set_value(e, 'Appearances', 2, takeable=False)
#df.at[e, 'Appearances'] +=1
else:
#adds new row with the subreddit name and sets the amount of appearances to 1.
df = df.append({'Subreddit': e, 'Appearances': 1}, ignore_index=True)
print(e)
counter = counter + 2
print(df)
The only part that is giving me trouble is the if inDF section. I cannot figure out how to add 1 to the 'Appearances' of the subreddit.
Your logic is a bit messy here, you don't need 3 references to inDF, or need to instantiate it with None, or use built-in any with a pd.Series object.
You can check whether the value exists in a series via the in operator:
if e in df['Subreddit'].values:
df.loc[df['Subreddit'] == e, 'Appearances'] += 1
else:
df = df.append({'Subreddit': e, 'Appearances': 1}, ignore_index=True)
Even better, use a defaultdict in your loop and create your dataframe at the very end of the process. Your current use of pd.DataFrame.append is not recommended as the expensive operation is being repeated for each row.
from collections import defaultdict
#initialise dictionary
dd = defaultdict(int)
while counter < 50:
e = ... # gets just the subreddit name
dd[e] += 1 # increment count by 1
counter = counter + 2 # increment while loop counter
# create results dataframe
df = pd.DataFrame.from_dict(dd, orient='index').reset_index()
# rename columns
df.columns = ['Subreddit', 'Appearances']
You can use df.loc[df['Subreddits'] == e, 'Appearances'] += 1
example:
df = pd.DataFrame(columns=['Subreddits', 'Appearances'])
e_list = ['a', 'b', 'a', 'a', 'b', 'c']
for e in e_list:
inDF = (df['Subreddits'] == e).sum() > 0
if inDF:
df.loc[df['Subreddits'] == e, 'Appearances'] += 1
else:
df = df.append([{'Subreddits': e, 'Appearances': 1}])
df.reset_index(inplace=True, drop=True) # good idea to reset the index..
print(df)
Subreddits Appearances
0 a 3
1 b 2
2 c 1
Related
I am trying to create a list of 5 letter words in a pandas dataframe, which splits the word into different columns and assigns a value to each letter, then performs a summation of the values. The following code imports a .json dictionary and assigns values to a letter:
import json
import pandas as pd
def split(word):
return [char for char in word]
j = open('words_dictionary.json')
dictionary = json.load(j)
dictionary_db = pd.DataFrame(columns=['Word'])
letter_db = pd.DataFrame(columns=['Letter'])
word_count = 0
num_word_count = 0
letter_count = 0
for i in dictionary:
word_count += 1
if len(i) == 5:
dictionary1_db = pd.DataFrame({i}, columns=['Word'])
dictionary_db = pd.concat([dictionary1_db, dictionary_db], ignore_index=True, axis=0)
num_word_count += 1
split_word = split(i)
L1 = split_word[0]
L2 = split_word[1]
L3 = split_word[2]
L4 = split_word[3]
L5 = split_word[4]
for s in split_word:
letter_count += 1
letter1_db = pd.DataFrame({s}, columns=['Letter'])
letter_db = pd.concat([letter_db, letter1_db], ignore_index=True, axis=0)
grouped = letter_db.groupby('Letter').value_counts()
grouped_db = pd.DataFrame(grouped, columns=['Value'])
grouped_db = grouped_db.apply(lambda x: (x/num_word_count)*.2, axis=1)
grouped_dict = grouped_db.to_dict()
Resulting in a grouped_db of:
Letter
Value
0
a
0.10544
1
b
0.02625
2
c
0.03448
..
..
..
as well as a similar dictionary:
grouped_dict = {'a': 0.10544, 'b': 0.02625, 'c': 0.03448, ...}
My problems start to occur when I try to map the string value to the float value.
How would I go about either merging a float value to the specified letter key or mapping a dictionary float value to the specified letter key without causing an 'NaN' value error?
ATTEMPT 1:
df = pd.DataFrame([{'L1': 'a'}])
df['L1Val'] = df['L1'].map(grouped_dict)
df:
L1
L1Val
0
a
nan
intended df output:
L1
L1Val
0
a
.10544
You can merge the both dataframes as follows:
df.merge(grouped_db, left_on="L1", right_on="Letter")
The columns L1 and Letter will be redundant but you can filter one out afterwards.
I made a surname dict containing surnames like this:
--The files contains 200 000 words, and this is a sample on the surname_dict--
['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN']
I am not allow to use counter library or numpy, just native Python.
My idea was to use for-loop sorting through the dictionary, but just hit some walls. Please help with some advice.
Thanks.
surname_dict = []
count = 0
for index in data_list:
if index["lastname"] not in surname_dict:
count = count + 1
surname_dict.append(index["lastname"])
for k, v in sorted(surname_dict.items(), key=lambda item: item[1]):
if count < 10: # Print only the top 10 surnames
print(k)
count += 1
else:
break
As mentioned in a comment, your dict is actually a list.
Try using the Counter object from the collections library. In the below example, I have edited your list so that it contains a few duplicates.
from collections import Counter
surnames = ['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN', 'OLDERVIK', 'ØSTBY', 'ØSTBY']
counter = Counter(surnames)
for name in counter.most_common(3):
print(name)
The result becomes:
('ØSTBY', 3)
('OLDERVIK', 2)
('KRISTIANSEN', 1)
Change the integer argument to most_common to 10 for your use case.
The best approach to answer your question is to consider the top ten categories :
for example : category of names that are used 9 times and category of names that are used 200 times and so . Because , we could have a case where 100 of users use different usernames but all of them have to be on the top 10 used username. So to implement my approach here is the script :
def counter(file : list):
L = set(file)
i = 0
M = {}
for j in L :
for k in file :
if j == k:
i+=1
M.update({i : j})
i = 0
D = list(M.keys())
D.sort()
F = {}
if len(D)>= 10:
K = D[0:10]
for i in K:
F.update({i:D[i]})
return F
else :
return M
Note: my script calculate the top ten categories .
You could place all the values in a dictionary where the value is the number of times it appears in the dataset, and filter through your newly created dictionary and push any result that has a value count > 10 to your final array.
edit: your surname_dict was initialized as an array, not a dictionary.
surname_dict = {}
top_ten = []
for index in data_list:
if index['lastname'] not in surname_dict.keys():
surname_dict[index['lastname']] = 1
else:
surname_dict[index['lastname']] += 1
for k, v in sorted(surname_dict.items()):
if v >= 10:
top_ten.append(k)
return top_ten
Just use a standard dictionary. I've added some duplicates to your data, and am using a threshold value to grab any names with more than 2 occurences. Use threshold = 10 for your actual code.
names = ['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY','ØSTBY','ØSTBY','REMLO', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN']
# you need 10 in your code, but I've only added a few dups to your sample data
threshold = 2
di = {}
for name in names:
#grab name count, initialize to zero first time
count = di.get(name, 0)
di[name] = count + 1
#basic filtering, no sorting
unsorted = {name:count for name, count in di.items() if count >= threshold}
print(f"{unsorted=}")
#sorting by frequency: filter out the ones you don't want
bigenough = [(count, name) for name, count in di.items() if count >= threshold]
tops = sorted(bigenough, reverse=True)
print(f"{tops=}")
#or as another dict
tops_dict = {name:count for count, name in tops}
print(f"{tops_dict=}")
Output:
unsorted={'ØSTBY': 3, 'REMLO': 2}
tops=[(3, 'ØSTBY'), (2, 'REMLO')]
tops_dict={'ØSTBY': 3, 'REMLO': 2}
Update.
Wanted to share what code I made in the end. Thank you guys so much. The feedback really helped.
Code:
etternavn_dict = {}
for index in data_list:
if index['etternavn'] not in etternavn_dict.keys():
etternavn_dict[index['etternavn']] = 1
else:
etternavn_dict[index['etternavn']] += 1
print("\nTopp 10 etternavn:")
count = 0
for k, v in sorted(etternavn_dict.items(), key=lambda item: item[1]):
if count < 10:
print(k)
count += 1
else:
break
When iterating through column elements (Y,Y,nan,Y in my case) for some reason I can't add a new element when a condition is met (if twice Y,Y is encountered) I want to replace the last Y with: "encountered" or simply just add it or rewrite it since I have track of the index number.
I have a dataframe
col0 col1
1 A Y
2 B Y
3 B nan
4 C Y
code:
count = 0
for i,e in enumerate(df[col1]):
if 'Y' in e:
count += 1
else:
count = 0
if count == 2:
df['col1'][i] = 'encountered' #Index errror: list index out of range
error message:
IndexError: list index out of range
Even if I try to specify the index in which column-cell I would like to 'add the msg to' gives me the same error:
code;
df['col1'][1] = 'or this'
main idea direct example:
df['col1'][2] = 'under index 2 in column1 add this msg'
is it because of the pyPDF2/utils is interfering?
warning:
File "C:\Users\path\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
error:
IndexError: list index out of range
last_index=df[df['col1']=='Y'].index[-1]
df.loc[last_index,'col1']='encountered'
Here's how I would go about solving this:
prev_val = None
# Iterate through rows to utilize the index
for idx, row in df[['col1']].iterrows():
# unpack your row, a bit more overhead but highly readable
val = row['col1']
# Use previous value instead of counter – is easier to read and is more accurate
if val == 'Y' and val == prev_val:
df.loc[idx, 'col1'] = 'encountered'
# now set the prev value to current:
prev_val = val
What possibly could be the issue with your code is the way you are iterating over your dataframe and also the indexing. Another issue is that you are trying to set the value you are iterating over with a new value. That bound give you issues later.
Does this works for you?:
>>> count = 0
>>> df['encounter'] = np.nan
>>> for i in df.itertuples():
>>> if getattr(i, 'col1')=='Y':
>>> count+=1
>>> else:
>>> count = 0
>>> if count==2:
>>> df.loc[i[0], 'encounter']= 'encountered'
>>> print(df)
col0 col1 encounter
0 A Y NaN
1 B Y encountered
2 B NaN NaN
3 C Y NaN
I'm trying to make a data visualization app, which is introduced a file type CSV and then select the columns to represent (not all columns are represented), I already got the function to select only a few variables, but now I need to join those columns in a single data frame to work with, I tried to do this:
for i in range(0, len(data1.columns)):
i = 0
df = np.array(data1[data1.columns[i]])
i +=1
print(df)
But I've only got the same column repeated numb_selection = numb_columns_dataframe (i.e. if I select 5 columns, the same column returns 5 times)
How do I ensure that for each iteration I insert a different column and not always the same one?
The problem of repeating one column is in i rewriting.
# For example `data1.columns` is ["a", "b", "c", "d", "e"]
# Your code:
for i in range(0, len(data1.columns)):
i = 0 # Here, in every interaction, set into 0
print(i, data1.columns[i], sep=": ")
i += 1
# Output:
# 0: a
# 0: a
# 0: a
# 0: a
# 0: a
i = 0 & i += 1 are useless because you already get i fromrange, ranging from 0 to len (data1.columns).
Fixed version
for i in range(0, len(data1.columns)):
print(i, data1.columns[i], sep=": ")
# Output:
# 0: a
# 1: b
# 2: c
# 3: d
# 5: e
Versions using manual increment i plus iteration through elements:
# First step, iter over columns
for col in data1.columns:
print(col)
# Output:
# a
# b
# c
# d
# e
# Step two, manual increment to obtain the list (array) index
i = 0
for col in data1.columns:
print(i, col, sep=": ")
i += 1
# Output:
# 0: a
# 1: b
# 2: c
# 3: d
# 5: e
Helpful to know, enumerate:
Function enumerate(iterable) is nice for obtain key of index and value itself.
print(list(enumerate(["Hello", "world"])))
# Output:
[
(0, "Hello"),
(1, "world")
]
Usage:
for i, col in enumerate(data1.columns):
print(i, col, sep=": ")
# Output:
# 0: a
# 1: b
# 2: c
# 3: d
# 5: e
At the end I solved it, declaring an empty list before the loop, iterating on the selected variables and saving the indexes in this list. So I get a list with the indexes that I should use for my visualization.
def get_index(name):
'''
return the index of a column name
'''
for column in df.columns:
if column == name:
index = df.columns.get_loc(column)
return index
result=[]
for i in range(len(selected)):
X = get_index(selected[i])
result.append(X)
df = df[df.columns[result]]
x = df.values
Where 'selected' is the list of selected variables (filter first by column name, then get its index number), I don't know if it's the most elegant way to do this, but it works well.
I am making a Python script that parses an Excel file using the xlrd library.
What I would like is to do calculations on different columns if the cells contain a certain value. Otherwise, skip those values. Then store the output in a dictionary.
Here's what I tried to do :
import xlrd
workbook = xlrd.open_workbook('filter_data.xlsx')
worksheet = workbook.sheet_by_name('filter_data')
num_rows = worksheet.nrows -1
num_cells = worksheet.ncols - 1
first_col = 0
scnd_col = 1
third_col = 2
# Read Data into double level dictionary
celldict = dict()
for curr_row in range(num_rows) :
cell0_val = int(worksheet.cell_value(curr_row+1,first_col))
cell1_val = worksheet.cell_value(curr_row,scnd_col)
cell2_val = worksheet.cell_value(curr_row,third_col)
if cell1_val[:3] == 'BL1' :
if cell2_val=='toSkip' :
continue
elif cell1_val[:3] == 'OUT' :
if cell2_val == 'toSkip' :
continue
if not cell0_val in celldict :
celldict[cell0_val] = dict()
# if the entry isn't in the second level dictionary then add it, with count 1
if not cell1_val in celldict[cell0_val] :
celldict[cell0_val][cell1_val] = 1
# Otherwise increase the count
else :
celldict[cell0_val][cell1_val] += 1
So here as you can see, I count the number of "cell1_val" values for each "cell0_val". But I would like to skip those values which have "toSkip" in the adjacent column's cell before doing the sum and storing it in the dict.
I am doing something wrong here, and I feel like the solution is much more simple.
Any help would be appreciated. Thanks.
Here's an example of my sheet :
cell0 cell1 cell2
12 BL1 toSkip
12 BL1 doNotSkip
12 OUT3 doNotSkip
12 OUT3 toSkip
13 BL1 doNotSkip
13 BL1 toSkip
13 OUT3 doNotSkip
Use collections.defaultdict with collections.Counter for your nested dictionary.
Here it is in action:
>>> from collections import defaultdict, Counter
>>> d = defaultdict(Counter)
>>> d['red']['blue'] += 1
>>> d['green']['brown'] += 1
>>> d['red']['blue'] += 1
>>> pprint.pprint(d)
{'green': Counter({'brown': 1}),
'red': Counter({'blue': 2})}
Here it is integrated into your code:
from collections import defaultdict, Counter
import xlrd
workbook = xlrd.open_workbook('filter_data.xlsx')
worksheet = workbook.sheet_by_name('filter_data')
first_col = 0
scnd_col = 1
third_col = 2
celldict = defaultdict(Counter)
for curr_row in range(1, worksheet.nrows): # start at 1 skips header row
cell0_val = int(worksheet.cell_value(curr_row, first_col))
cell1_val = worksheet.cell_value(curr_row, scnd_col)
cell2_val = worksheet.cell_value(curr_row, third_col)
if cell2_val == 'toSkip' and cell1_val[:3] in ('BL1', 'OUT'):
continue
celldict[cell0_val][cell1_val] += 1
I also combined your if-statments and changed the calculation of curr_row to be simpler.
It appears you want to skip the current line whenever cell2_val equals 'toSkip', so it would simplify the code if you add if cell2_val=='toSkip' : continue directly after computing cell2_val.
Also, where you have
# if the entry isn't in the second level dictionary then add it, with count 1
if not cell1_val in celldict[cell0_val] :
celldict[cell0_val][cell1_val] = 1
# Otherwise increase the count
else :
celldict[cell0_val][cell1_val] += 1
the usual idiom is more like
celldict[cell0_val][cell1_val] = celldict[cell0_val].get(cell1_val, 0) + 1
That is, use a default value of 0 so that if key cell1_val is not yet in celldict[cell0_val], then get() will return 0.