skip excel lines with python - python

I am making a Python script that parses an Excel file using the xlrd library.
What I would like is to do calculations on different columns if the cells contain a certain value. Otherwise, skip those values. Then store the output in a dictionary.
Here's what I tried to do :
import xlrd
workbook = xlrd.open_workbook('filter_data.xlsx')
worksheet = workbook.sheet_by_name('filter_data')
num_rows = worksheet.nrows -1
num_cells = worksheet.ncols - 1
first_col = 0
scnd_col = 1
third_col = 2
# Read Data into double level dictionary
celldict = dict()
for curr_row in range(num_rows) :
cell0_val = int(worksheet.cell_value(curr_row+1,first_col))
cell1_val = worksheet.cell_value(curr_row,scnd_col)
cell2_val = worksheet.cell_value(curr_row,third_col)
if cell1_val[:3] == 'BL1' :
if cell2_val=='toSkip' :
continue
elif cell1_val[:3] == 'OUT' :
if cell2_val == 'toSkip' :
continue
if not cell0_val in celldict :
celldict[cell0_val] = dict()
# if the entry isn't in the second level dictionary then add it, with count 1
if not cell1_val in celldict[cell0_val] :
celldict[cell0_val][cell1_val] = 1
# Otherwise increase the count
else :
celldict[cell0_val][cell1_val] += 1
So here as you can see, I count the number of "cell1_val" values for each "cell0_val". But I would like to skip those values which have "toSkip" in the adjacent column's cell before doing the sum and storing it in the dict.
I am doing something wrong here, and I feel like the solution is much more simple.
Any help would be appreciated. Thanks.
Here's an example of my sheet :
cell0 cell1 cell2
12 BL1 toSkip
12 BL1 doNotSkip
12 OUT3 doNotSkip
12 OUT3 toSkip
13 BL1 doNotSkip
13 BL1 toSkip
13 OUT3 doNotSkip

Use collections.defaultdict with collections.Counter for your nested dictionary.
Here it is in action:
>>> from collections import defaultdict, Counter
>>> d = defaultdict(Counter)
>>> d['red']['blue'] += 1
>>> d['green']['brown'] += 1
>>> d['red']['blue'] += 1
>>> pprint.pprint(d)
{'green': Counter({'brown': 1}),
'red': Counter({'blue': 2})}
Here it is integrated into your code:
from collections import defaultdict, Counter
import xlrd
workbook = xlrd.open_workbook('filter_data.xlsx')
worksheet = workbook.sheet_by_name('filter_data')
first_col = 0
scnd_col = 1
third_col = 2
celldict = defaultdict(Counter)
for curr_row in range(1, worksheet.nrows): # start at 1 skips header row
cell0_val = int(worksheet.cell_value(curr_row, first_col))
cell1_val = worksheet.cell_value(curr_row, scnd_col)
cell2_val = worksheet.cell_value(curr_row, third_col)
if cell2_val == 'toSkip' and cell1_val[:3] in ('BL1', 'OUT'):
continue
celldict[cell0_val][cell1_val] += 1
I also combined your if-statments and changed the calculation of curr_row to be simpler.

It appears you want to skip the current line whenever cell2_val equals 'toSkip', so it would simplify the code if you add if cell2_val=='toSkip' : continue directly after computing cell2_val.
Also, where you have
# if the entry isn't in the second level dictionary then add it, with count 1
if not cell1_val in celldict[cell0_val] :
celldict[cell0_val][cell1_val] = 1
# Otherwise increase the count
else :
celldict[cell0_val][cell1_val] += 1
the usual idiom is more like
celldict[cell0_val][cell1_val] = celldict[cell0_val].get(cell1_val, 0) + 1
That is, use a default value of 0 so that if key cell1_val is not yet in celldict[cell0_val], then get() will return 0.

Related

Need to find top 10 used surnames in a files. Made a dictonary but need to sort it the rest

I made a surname dict containing surnames like this:
--The files contains 200 000 words, and this is a sample on the surname_dict--
['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN']
I am not allow to use counter library or numpy, just native Python.
My idea was to use for-loop sorting through the dictionary, but just hit some walls. Please help with some advice.
Thanks.
surname_dict = []
count = 0
for index in data_list:
if index["lastname"] not in surname_dict:
count = count + 1
surname_dict.append(index["lastname"])
for k, v in sorted(surname_dict.items(), key=lambda item: item[1]):
if count < 10: # Print only the top 10 surnames
print(k)
count += 1
else:
break
As mentioned in a comment, your dict is actually a list.
Try using the Counter object from the collections library. In the below example, I have edited your list so that it contains a few duplicates.
from collections import Counter
surnames = ['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN', 'OLDERVIK', 'ØSTBY', 'ØSTBY']
counter = Counter(surnames)
for name in counter.most_common(3):
print(name)
The result becomes:
('ØSTBY', 3)
('OLDERVIK', 2)
('KRISTIANSEN', 1)
Change the integer argument to most_common to 10 for your use case.
The best approach to answer your question is to consider the top ten categories :
for example : category of names that are used 9 times and category of names that are used 200 times and so . Because , we could have a case where 100 of users use different usernames but all of them have to be on the top 10 used username. So to implement my approach here is the script :
def counter(file : list):
L = set(file)
i = 0
M = {}
for j in L :
for k in file :
if j == k:
i+=1
M.update({i : j})
i = 0
D = list(M.keys())
D.sort()
F = {}
if len(D)>= 10:
K = D[0:10]
for i in K:
F.update({i:D[i]})
return F
else :
return M
Note: my script calculate the top ten categories .
You could place all the values in a dictionary where the value is the number of times it appears in the dataset, and filter through your newly created dictionary and push any result that has a value count > 10 to your final array.
edit: your surname_dict was initialized as an array, not a dictionary.
surname_dict = {}
top_ten = []
for index in data_list:
if index['lastname'] not in surname_dict.keys():
surname_dict[index['lastname']] = 1
else:
surname_dict[index['lastname']] += 1
for k, v in sorted(surname_dict.items()):
if v >= 10:
top_ten.append(k)
return top_ten
Just use a standard dictionary. I've added some duplicates to your data, and am using a threshold value to grab any names with more than 2 occurences. Use threshold = 10 for your actual code.
names = ['KRISTIANSEN', 'OLDERVIK', 'GJERSTAD', 'VESTLY SKIVIK', 'NYMANN', 'ØSTBY','ØSTBY','ØSTBY','REMLO', 'LINNERUD', 'REMLO', 'SKARSHAUG', 'ELI', 'ADOLFSEN']
# you need 10 in your code, but I've only added a few dups to your sample data
threshold = 2
di = {}
for name in names:
#grab name count, initialize to zero first time
count = di.get(name, 0)
di[name] = count + 1
#basic filtering, no sorting
unsorted = {name:count for name, count in di.items() if count >= threshold}
print(f"{unsorted=}")
#sorting by frequency: filter out the ones you don't want
bigenough = [(count, name) for name, count in di.items() if count >= threshold]
tops = sorted(bigenough, reverse=True)
print(f"{tops=}")
#or as another dict
tops_dict = {name:count for count, name in tops}
print(f"{tops_dict=}")
Output:
unsorted={'ØSTBY': 3, 'REMLO': 2}
tops=[(3, 'ØSTBY'), (2, 'REMLO')]
tops_dict={'ØSTBY': 3, 'REMLO': 2}
Update.
Wanted to share what code I made in the end. Thank you guys so much. The feedback really helped.
Code:
etternavn_dict = {}
for index in data_list:
if index['etternavn'] not in etternavn_dict.keys():
etternavn_dict[index['etternavn']] = 1
else:
etternavn_dict[index['etternavn']] += 1
print("\nTopp 10 etternavn:")
count = 0
for k, v in sorted(etternavn_dict.items(), key=lambda item: item[1]):
if count < 10:
print(k)
count += 1
else:
break

How to dump contents of an array to a pre-existing csv with hardcoded data in python

I have posted this question earlier.
At the output of this program, I get an array that has 4 elements
like this:
11111111,22222222,kkkkkk,lllllll
33333333,44444444,oooooo,ppppppp
qqqqqqqq,rrrrrr,ssssss,ttttttt
Now I have another csv which has more columns(let's say 10) and a some of those columns have hardcoded data, something like this -
head1,head2,head3,head4,head5,head6,head7,head8,head9,head10
-,123,345,something,<blank>,<blank>,-,-,-,-
so except for the everything is hardcoded.
I want to print the first and second columns of my output in these blank spaces and repeat the hardcoded data on every row.
So my output should be something like this:
head1,head2,head3,head4,head5,head6,head7,head8,head9,head10
-,123,345,something,11111111,22222222,-,-,-,-
-,123,345,something,33333333,44444444,-,-,-,-
-,123,345,something,qqqqqqqq,rrrrrr,-,-,-,-
Approach:
1) Read lines from the done.csv and append them to separate lists.
2) Read the new csv with empty column data, Lets call it missing_data.csv
3) Iterate for the number of lists in 1) i.e. 3 in your case.
4) Iterate over each column of missing_data.csv until an empty value is found
5) Fill the empty column with the list currently running from 3)
Hence:
1):
import pandas as pd
initial_data1 = []
initial_data2 = []
initial_data3 = []
line_num = 1
with open ("list.txt") as f:
content = f.readlines()
for line in content:
if line_num == 1:
initial_data1.append(line.split(","))
elif line_num == 2:
initial_data2.append(line.split(","))
elif line_num == 3:
initial_data3.append(line.split(","))
line_num = line_num + 1
print(initial_data1)
print(initial_data2)
print(initial_data3)
OUTPUT:
[['11111111', '22222222', 'kkkkkk', 'lllllll\n']]
[['33333333', '44444444', 'oooooo', 'ppppppp\n']]
[['qqqqqqqq', 'rrrrrr', 'ssssss', 'ttttttt']]
The rest:
df = pd.read_csv("missing_data.csv")
heads = ['head1','head2','head3','head4','head5','head6','head7','head8','head9','head10']
appending_line = 0
for index, row in df.iterrows():
if appending_line == 0:
initial_data = initial_data1
elif appending_line == 1:
initial_data = initial_data2
elif appending_line == 2:
initial_data = initial_data3
j = 0
k = 0
appending_line += 1
for i in range(0, len(heads)): # for the number of heads
if str(row[heads[i]]) == " ":
print("FOUND EMPTY COLUMN: ", heads[i])
print("APPENDING VALUE: ", initial_data[j][k])
row[heads[i]] = initial_data[j][k]
k += 1
OUTPUT:
FOUND EMPTY COLUMN VALUE: head5
APPENDING VALUE: 11111111
FOUND EMPTY COLUMN VALUE: head6
APPENDING VALUE: 22222222
FOUND EMPTY COLUMN VALUE: head5
APPENDING VALUE: 33333333
FOUND EMPTY COLUMN VALUE: head6
APPENDING VALUE: 44444444
FOUND EMPTY COLUMN VALUE: head5
APPENDING VALUE: qqqqqqqq
FOUND EMPTY COLUMN VALUE: head6
APPENDING VALUE: rrrrrr

how to change one value from Pandas DataFrame

I have 2 columns in my dataframe, one called 'Subreddits' which lists string values, and one called 'Appearances' which lists how many times they appear.
I am trying to add 1 to the value of a certain line in the 'Appearances' column when it detects a string value that is already in the dataframe.
df = pd.read_csv(Location)
print(len(elem))
while counter < 50:
#gets just the subreddit name
e = str(elem[counter].get_attribute("href"))
e = e.replace("https://www.reddit.com/r/", "")
e = e[:-1]
inDf = None
if (any(df.Subreddit == e)):
print("Y")
inDf = True
if inDf:
#adds 1 to the value of Appearances
#df.set_value(e, 'Appearances', 2, takeable=False)
#df.at[e, 'Appearances'] +=1
else:
#adds new row with the subreddit name and sets the amount of appearances to 1.
df = df.append({'Subreddit': e, 'Appearances': 1}, ignore_index=True)
print(e)
counter = counter + 2
print(df)
The only part that is giving me trouble is the if inDF section. I cannot figure out how to add 1 to the 'Appearances' of the subreddit.
Your logic is a bit messy here, you don't need 3 references to inDF, or need to instantiate it with None, or use built-in any with a pd.Series object.
You can check whether the value exists in a series via the in operator:
if e in df['Subreddit'].values:
df.loc[df['Subreddit'] == e, 'Appearances'] += 1
else:
df = df.append({'Subreddit': e, 'Appearances': 1}, ignore_index=True)
Even better, use a defaultdict in your loop and create your dataframe at the very end of the process. Your current use of pd.DataFrame.append is not recommended as the expensive operation is being repeated for each row.
from collections import defaultdict
#initialise dictionary
dd = defaultdict(int)
while counter < 50:
e = ... # gets just the subreddit name
dd[e] += 1 # increment count by 1
counter = counter + 2 # increment while loop counter
# create results dataframe
df = pd.DataFrame.from_dict(dd, orient='index').reset_index()
# rename columns
df.columns = ['Subreddit', 'Appearances']
You can use df.loc[df['Subreddits'] == e, 'Appearances'] += 1
example:
df = pd.DataFrame(columns=['Subreddits', 'Appearances'])
e_list = ['a', 'b', 'a', 'a', 'b', 'c']
for e in e_list:
inDF = (df['Subreddits'] == e).sum() > 0
if inDF:
df.loc[df['Subreddits'] == e, 'Appearances'] += 1
else:
df = df.append([{'Subreddits': e, 'Appearances': 1}])
df.reset_index(inplace=True, drop=True) # good idea to reset the index..
print(df)
Subreddits Appearances
0 a 3
1 b 2
2 c 1

Optimize searching two text files and output based upon a third using Python

I'm having performance issues with a python function that I'm using loading two 5+ GB tab delineated txt files that are the same format with different values and using a third text file as a key to determine which values should be kept for output. I'd like some help for speed gains if possible.
Here is the code:
def rchfile():
# there are 24752 text lines per stress period, 520 columns, 476 rows
# there are 52 lines per MODFLOW model row
lst = []
out = []
tcel = 0
end_loop_break = False
# key file that will set which file values to use. If cell address is not present or value of cellid = 1 use
# baseline.csv, otherwise use test_p97 file.
with open('input/nrd_cells.csv') as csvfile:
reader = csv.reader(csvfile)
for item in reader:
lst.append([int(item[0]), int(item[1])])
# two files that are used for data
with open('input/test_baseline.rch', 'r') as b, open('input/test_p97.rch', 'r') as c:
for x in range(3): # skip the first 3 lines that are the file header
b.readline()
c.readline()
while True: # loop until end of file, this should loop here 1,025 times
if end_loop_break == True: break
for x in range(2): # skip the first 2 lines that are the stress period header
b.readline()
c.readline()
for rw in range(1, 477):
if end_loop_break == True: break
for cl in range(52):
# read both files at the same time to get the different data and split the 10 values in the row
b_row = b.readline().split()
c_row = c.readline().split()
if not b_row:
end_loop_break == True
break
for x in range(1, 11):
# search for the cell address in the key file to find which files datat to keep
testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]
if not testval: # cell address not in key file
out.append(b_row[x - 1])
elif lst[testval[0]][1] == 1: # cell address value == 1
out.append(b_row[x - 1])
elif lst[testval[0]][1] == 2: # cell address value == 2
out.append(c_row[x - 1])
print(cl * 10 + x + tcel) # test output for cell location
tcel += 520
print('success')`
The key file looks like:
37794, 1
37795, 0
37796, 2
The data files are large ~5GB each and complex from a counting standpoint, but are standard in format and look like:
0 0 0 0 0 0 0 0 0 0
1.5 1.5 0 0 0 0 0 0 0 0
This process is taking a very long time and was hoping someone could help speed it up.
I believe your speed problem is coming from this line:
testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]
You are iterating over the whole key list for every single value in the HUGE output files. This is not good.
It looks like cl * 10 + x + tcel is the formula you are looking for in lst[n][0].
May I suggest you use a dict instead of a list for storing the data in lst.
lst = {}
for item in reader:
lst[int(item[0])] = int(item[1])
Now, lst is a mapping, which means you can simply use the in operator to check for the presence of a key. This is a near instant lookup because the dict type is hash based and very efficient for key lookups.
something in lst
# for example
(cl * 10 + x) in lst
And you can grab the value by:
lst[something]
#or
lst[cl * 10 + x]
A little bit of refactoring and your code should PROFOUNDLY speed up.

split values in dictionary in separate values

I have this type of string:
sheet = """
magenta
turquoise,PF00575
tan,PF00154,PF06745,PF08423,PF13481,PF14520
turquoise, PF00011
NULL
"""
Every line starts with an identifier (e.g. tan, magenta...) What I want is to count the number of occurrences of each PF-number per identifier.
So, the final structure would be something like this:
magenta turquoise tan NULL
PF00575 0 0 0 0
PF00154 0 1 0 0
PF06745 0 0 1 0
PF08423 0 0 1 0
PF13481 0 0 1 0
PF14520 0 0 1 0
PF00011 0 1 0 0
I started with making a a dictionary where every first word on a line is a key and then I want as values the PF-numbers behind it.
When I use this code, I get the values as a list of strings instead of as separate values in the dictionary:
lines = []
lines.append(sheet.split("\n"))
flattened=[]
flattened = [val for sublist in lines for val in sublist]
pfams = []
for i in flattened:
pfams.append(i.split(","))
d = defaultdict(list)
for i in pfams:
pfam = i[0]
d[pfam].append(i[1:])
So, the result is this:
defaultdict(<type 'list'>, {'': [[], []], 'magenta': [[]], 'NULL': [[]], 'turquoise': [['PF00575']], 'tan': [['PF00154', 'PF06745', 'PF08423', 'PF13481', 'PF14520']]})
How can I split up the PFnumbers so that they are separate values in the dictionary and then count the number of occurrences of each unique PF-number per key?
Use collections.Counter (https://docs.python.org/2/library/collections.html#collections.Counter)
import collections
sheet = """
magenta
turquoise,PF00575
tan,PF00154,PF06745,PF08423,PF13481,PF14520
NULL
"""
acc = {}
for line in sheet.split('\n'):
if line == "NULL":
continue
parts = line.split(',')
acc[parts[0]] = collections.Counter(parts[1])
EDIT: Now with accumulating all PF values for each key
acc = collections.defaultdict(list)
for line in sheet.split('\n'):
if line == "NULL":
continue
parts = line.split(',')
acc[parts[0]] += parts[1:]
acc = {k: collections.Counter(v) for k,v in acc.iteritems()}
Final edit Count the occurrence of colours per PF value, which is what we were after all along, in the end:
acc = collections.defaultdict(list)
for line in sheet.split('\n'):
if line == "NULL":
continue
parts = line.split(',')
for pfval in parts[1:]
acc[ pfval ] += [ parts[0] ]
acc = {k: collections.Counter(v) for k,v in acc.iteritems()}
With thanks to dwblas on devshed, this is the most efficient way I've found to tackle the task:
I build a dictionary whose key is the PFnumber, and a list ordered by how I want the colors printed.
colors_list= ['cyan','darkorange','greenyellow','yellow','magenta','blue','green','midnightblue','brown','darkred','lightcyan','lightgreen','darkgreen','royalblue','orange','purple','tan','grey60','darkturquoise','red','lightyellow','darkgrey','turquoise','salmon','black','pink','grey','null']
lines = sheet.splitlines()
counts = {}
for line in lines:
parts = line.split(",")
if len(parts) > 1:
## doesn't break out the same item in the list many times
color=parts[0].strip().lower()
for key in parts[1:]: ## skip color
key=key.strip()
if key not in counts:
## new key and list of zeroes-print it if you want to verify
counts[key]=[0 for ctr in range(len(colors_list))]
## offset number/location of this color in list
el_number=colors_list.index(color)
if color > -1: ## color found
counts[key][el_number] += 1
else:
print "some error message"
import csv
with open("out.csv", "wb") as f:
writer=csv.writer(f)
writer.writerow( ["PFAM",] + colors_list)
for pfam in counts:
writer.writerow([pfam] + counts[pfam])

Categories