I need a little help reading specific values into a dictionary using Python. I have a csv file with User numbers. So user 1,2,3... each user is within a specific department 1,2,3... and each department is in a specific building 1,2,3... So I need to know how can I list all the users in department one in building 1 then department 2 in building 1 so on. I have been trying and have read everything into a massive dictionary using csv.ReadDict, but this would work if I could search through which entries I read into each dictionary of dictionaries. Any ideas for how to sort through this file? The CSV has over 150,000 entries for users. Each row is a new user and it lists 3 attributes, user_name, departmentnumber, department building. There are a 100 departments and 100 buildings and 150,000 users. Any ideas on a short script to sort them all out? Thanks for your help in advance
A brute-force approach would look like
import csv
csvFile = csv.reader(open('myfile.csv'))
data = list(csvFile)
data.sort(key=lambda x: (x[2], x[1], x[0]))
It could then be extended to
import csv
import collections
csvFile = csv.reader(open('myfile.csv'))
data = collections.defaultdict(lambda: collections.defaultdict(list))
for name, dept, building in csvFile:
data[building][dept].append(name)
buildings = data.keys()
buildings.sort()
for building in buildings:
print "Building {0}".format(building)
depts = data[building].keys()
depts.sort()
for dept in depts:
print " Dept {0}".format(dept)
names = data[building][dept]
names.sort()
for name in names:
print " ",name
Related
I've been given a homework task to get data from a csv file without using Pandas. The info in the csv file contains headers such as...
work year:
experience level: EN Entry-level / Junior MI Mid-level / Inter- mediate SE Senior-level / Expert EX Executive-level / Director
employment type: PT Part-time FT Full-time CT Contract FL Freelance
job title:
salary:
salary currency:
salaryinusd: The salary in USD
employee residence: Employee’s primary country of residence
remote ratio:
One of the questions is:
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
The only way I've managed to do this is to iterate through the csv and add a load of 'if' statements according to the experience level and job title, but this is taking me forever.
Any ideas of how to tackle this differently? Not using any libraries/modules.
Example of my code:
with open('/Users/xxx/Desktop/ds_salaries.csv', 'r') as f:
csv_reader = f.readlines()
for row in csv_reader[1:]:
new_row = row.split(',')
experience_level = new_row[2]
job_title = new_row[4]
salary_in_usd = new_row[7]
if experience_level == 'EN' and job_title == 'AI Scientist':
en_ai_scientist += int(salary_in_usd)
count_en_ai_scientist += 1
avg_en_ai_scientist = en_ai_scientist / count_en_ai_scientist
print(avg_en_ai_scientist)
Data:
When working out an example like this, I find it helpful to ask, "What data structure would make this question easy to answer?"
For example, the question asks
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
To me, this implies that I want a dictionary keyed by a tuple of experience level and job title, with the salaries of every person who matches. Something like this:
data = {
("EN", "AI Scientist"): [1000, 2000, 3000],
("SE", "AI Scientist"): [2000, 3000, 4000],
}
The next question is: how do I get my data into that format? I would read the data in with csv.DictReader, and add each salary number into the structure.
data = {}
with open('input.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
experience_level = row['first_name']
job_title = row['last_name']
key = experience_level, job_title
if key not in data:
# provide default value if no key exists
# look at collections.defaultdict if you want to see a better way to do this
data[key] = []
data[key].append(row['salary_in_usd'])
Now that you have your data organized, you can compute average salaries:
for (experience_level, job_title), salary_data in data:
print(experience_level, job_title, sum(salary_data)/len(salary_data))
I am very new to using python to process data on CSV files. I have a CSV file with the data below. I want to take the averages of the time stamps for each Sprint, Jog, and Walk column by session. The below example has the subject John Doe and Session2 and Session3 that I would like to find the averages of separately and write them to a new CSV file. Is there a way not using PANDAS but other modules like CSV or Numpy to gather the data by the person (subject) and then by session. I have tried to make a dictionary but the keys get overwritten. I have also tried using a List but I cannot figure out how to target the sessions to average them out. Not sure what I am doing wrong. I also tried using dictReader to read the fieldnames and then to process the data but I cannot figure out how to group all the John Doe Session2 data to find the average of the times.
Subject, Session, Course, Size, Category, Sprint, Jog, Walk
John Doe, Session2, 17, 2, Bad, 25s, 36s, 55s
John Doe, Session2, 3, 2, Good, 26s, 35s, 45s
John Doe, Session2, 1, 2, Good, 22s, 31s, 47s
John Doe, Session3, 5, 2, Good, 16s, 32s, 55s
John Doe, Session3, 2, 2, Good, 13s, 24s, 52s
John Doe, Session3, 16, 2, Bad, 15s, 26s, 49s
PS I say no PANDAS because my groupmates are not adding this module since we have so many other dependencies.
Given your input, these built-in Python libraries can generate the output you want:
import csv
from itertools import groupby
from operator import itemgetter
from collections import defaultdict
with open('input.csv','r',newline='') as fin,open('output.csv','w',newline='') as fout:
# skip needed because sample data had spaces after comma delimiters.
reader = csv.DictReader(fin,skipinitialspace=True)
# Output file will have these fieldnames
writer = csv.DictWriter(fout,fieldnames='Subject Session Sprint Jog Walk'.split())
writer.writeheader()
# for each subject/session, groupby returns a 2-tuple of sort key and an
# iterator over the rows of that key. Data must be sorted by the key already!
for (subject,session),group in groupby(reader,key=itemgetter('Subject','Session')):
# built the row to output. defaultdict(int) assumes integer(0) if key doesn't exist.
row = defaultdict(int)
row['Subject'] = subject
row['Session'] = session
# Count the items for average.
count = 0
for item in group:
count += 1
# sum the rows, removing the 's'
for col in ('Sprint','Jog','Walk'):
row[col] += int(item[col][:-1])
# produce the average
for col in ('Sprint','Jog','Walk'):
row[col] /= count
writer.writerow(row)
Output:
Subject,Session,Sprint,Jog,Walk
John Doe,Session2,24.333333333333332,34.0,49.0
John Doe,Session3,14.666666666666666,27.333333333333332,52.0
Function links: itemgetter
groupby
defaultdict
If your data is not pre-sorted, you can use the following replacement lines to read in and sort the data by using the same key used in groupby. However, in this implementation the data must be small enough to load it all into memory at once.
sortkey = itemgetter('Subject','Session')
data = sorted(reader,key=sortkey)
for (subject,session),group in groupby(data,key=sortkey):
...
As you want the average grouped by subject and session, just compose unique keys out of that information:
import csv
times = {}
with open('yourfile.csv', 'r') as csvfile[1:]:
for row in csv.reader(csvfile, delimiter=','):
key = row[0]+row[1]
if key not in times.keys():
times[key] = row[-3:]
else:
times[key].extend(row[-3:])
average = {k: sum([int(entry[:-1]) for entry in v])/len(v) for k, v in times.items()}
This assumes that the first two entries do have regular structure as in your example and there is no ambiguity when composing the two first entries per row. To be sure one could insert a special delimiter between them in the key.
If you are also the person storing the data: Writing the unit of a column in the column header saves transformation effort later and avoids redundant information storage.
I am trying to solve one of the Rosalind challenges and I can't seem to find a way to retrieve data, within a specific time frame.
http://rosalind.info/problems/gbk/
Do/How Do I modify Entrez.esearch() to specify a time frame?
Question:
Given: A genus name, followed by two dates in YYYY/M/D format.
Return: The number of Nucleotide GenBank entries for the given genus that were published between the dates specified.
Test Data:
Anthoxanthum
2003/7/25
2005/12/27
Answer: 7
Thanks a lot to #Kayvee for the pointer! It works like a charm!
Here is a format for searching the organism by 'posted between start-end':
(Anthoxanthum[Organism]) AND ("2003/7/25"[Publication Date] : "2005/12/27"[Publication Date])
Here is Python code:
# GenBank gene database
geneName = "Anthoxanthum"
pubDateStart = "2003/7/25"
pubDateEnd = "2005/12/27"
searchTerm = f'({geneName}[Organism]) AND("{pubDateStart}"[Publication Date]: "{pubDateEnd}"[Publication Date])'
print(f"\n[GenBank gene database]:")
Entrez.email = "please#pm.me"
handle = Entrez.esearch(db="nucleotide", term=searchTerm)
record = Entrez.read(handle)
print(record["Count"])
This is almost the same from my yesterday's question. But I took it for granted to use a unique value list to create the nested dict & list structure. But then, I came to the question of how to build this dict & list structure (refer as data structure) row by row from the excel data.
The excel files (multiple files in a folder) all look like the following:
Category Subcategory Name
Main Dish Noodle Tomato Noodle
Main Dish Stir Fry Chicken Rice
Main Dish Soup Beef Goulash
Drink Wine Bordeaux
Drink Softdrink Cola
My desired structure of dict & list structure is:
data = [0:{'data':0, 'Category':[
{'name':'Main Dish', 'Subcategory':[
{'name':'Noodle', 'key':0, 'data':['key':1, 'title':'Tomato Noodle']},
{'name':'Stir Fry', 'key':1, 'data':['key':2, 'title':'Chicken Rice']},
{'name':'Soup', 'key':2, 'data':['key':3, 'title':'Beef Goulash']}]},
{'name':'Drink', 'Subcategory':[
{'name':'Wine', 'key':0, 'data':['key':1, 'title':'Bordeaux']},
{'name':'Softdrink', 'key':1, 'data':['key':2, 'title':'cola'}]}]},
1:{'data':1, 'Category':.........#Same structure as dataset 0}]
So, for each excel file, it is fine, just loop through and set {'data':0, 'Category':[]}, {'data':1, 'Category':[]} and so on. The key is, for each Category and Subcategory values, Main Dish has three entries in excel, but only needs 1 in the data structure, and Drink has two entries in excel, but only 1 in the data structure. For each subcategory nested in the category list, they follow the same rule, only unique values should be nested to category. Then, each corresponding Name of dishes, they go into the data structure depending on their category and subcategory.
The issue is, I cannot find a better way to convert the data to this data structure. Plus, there are other columns after the Name column. So it is kind of sophisticated. I was thinking to first extract the unique values from the entire column of category and subcategory, this simplifies the process, but leads to problems when filling in the corresponding Name values. If I am doing this from a row by row approach, then designing a if subcategory exist or category exit test to keep unique values are somehow difficult based on my current programming skills...
Therefore, what would be the best approach to convert this excel file into such a data structure? Thank you very much.
One way could be to read the excelfile into a dataframe using pandas, and then build on this excellent answer Pandas convert DataFrame to Nested Json
import pandas as pd
excel_file = 'path-to-your-excel.xls'
def fdrec(df):
drec = dict()
ncols = df.values.shape[1]
for line in df.values:
d = drec
for j, col in enumerate(line[:-1]):
if not col in d.keys():
if j != ncols-2:
d[col] = {}
d = d[col]
else:
d[col] = line[-1]
else:
if j!= ncols-2:
d = d[col]
return drec
df = pd.read_excel(excel_file)
print(fdrec(df))
I know Python is almost made for these kind of purposes, but I am really struggling to understand how I get access to specific values in the dataset, and I tried both with pandas and csv modules. It is probably a matter of syntax. Here's the thing: I have a csv file in the form of
Nation, Year, No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
...
...
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
...
and so on. What I would like to do is to get the total amount of refugees per year, i.e. selecting all the rows with a certain year and summing all the elements in the related "no. of refugees" column. I managed to do this:
import csv
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
print headers
#2013
list2013=[]
for line in d_reader:
if (line['Year']=='2013'):
list2013.append(line['Refugees'])
list2013=map(int,list2013) #I have str values in my file
ref13=sum(list2013)
but I am looking for a more elegant (and, above all, iterative) solution. Moreover, if I perform that procedure multiple times for different years, I always get 0: it works for 2013 only, not sure why.
Edit: I tried this as well, without success, but I think this could be totally wrong:
import csv
refugees_dict={}
a=range(2005,2014)
a=map(str, a)
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
for element in a:
for line in d_reader:
if (line['Year']==element):
print 'hello!'
temp_list=[]
temp_list.append(line['Refugees'])
temp_list=map(int, temp_list)
refugees_dict[a]=sum(temp_list)
print refugees_dict
The next step of my work will involve further studies on the dataset, eg I am probably gonna need to access data nation-wise instead of year-wise, and I really appreciate any hint so I understand how to manipulate data.
Thanks a lot.
Since you tagged pandas in the question, here's a pandas solution to getting the number of refugees per year.
Let's say my input csv looks like this (note that I've eliminated the extra space before the column names):
Nation,Year,No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
You can read that into a pandas DataFrame like this:
df = pd.read_csv('data.csv')
You can then get the total like this:
df.groupby(['Year']).sum()
This gives:
No. of refugees
Year
2012 7370
2013 7150
Consider:
from collections import defaultdict
by_year = defaultdict(int) # a dict that has a 0 under every key.
and then
by_year[line['year']] += int(line['Refugees'])
Now you can just look at by_year['2013'] and see your sum (same for other years).
To sum by year you can try this:
f = open('file.csv').readlines()
f = [i.strip('\n').split(',') for i in f]
years = {i[1]:0 for i in f}
for i in f:
years[i[1]] += int(i[-1])
Now, you have a dictionary that has the sum of all the refugees by year.
To access nation-wise:
nations = {i[0]:0 for i in f}
for i in f:
nations[i[0]] += int(i[-1])