Getting the sum of a csv column without pandas in python - python

I have a csv file passed into a function as a string:
csv_input = """
quiz_date,location,size
2022-01-01,london_uk,134
2022-01-02,edingburgh_uk,65
2022-01-01,madrid_es,124
2022-01-02,london_uk,125
2022-01-01,edinburgh_uk,89
2022-01-02,madric_es,143
2022-01-02,london_uk,352
2022-01-01,edinburgh_uk,125
2022-01-01,madrid_es,431
2022-01-02,london_uk,151"""
I want to print the sum of how many people were surveyed in each city by date, so something like:
Date. City. Pop-Surveyed
2022-01-01. London. 134
2022-01-01. Edinburgh. 214
2022-01-01. Madrid. 555
2022-01-02. London. 628
2022-01-02. Edinburgh. 65
2022-01-02. Madrid. 143
As I can't import pandas on my machine (can't install without internet access) I thought I could use a defaultdict to store the value of each city by date
from collections import defaultdict
survery_data = csv_input.split()[1:]
survery_data = [survey.split(',') for survey in survery_data]
survey_sum = defaultdict(dict)
for survey in survery_data:
date = survey[0]
city = survey[1].split("_")[0]
quantity = survey[-1]
survey_sum[date][city] += quantity
print(survey_sum)
But doing this returns a KeyError:
KeyError: 'london'
When I was hoping to have a defaultdict of
{'2022-01-01': {'london': 134}, {'edinburgh': 214}, {'madrid': 555}},
{'2022-01-02': {'london': 628}, {'edinburgh': 65}, {'madrid': 143}}
Is there a way to create a default dict that gives a structure so I could then iterate over to print out each column like above?

Try:
csv_input = """\
quiz_date,location,size
2022-01-01,london_uk,134
2022-01-02,edingburgh_uk,65
2022-01-01,madrid_es,124
2022-01-02,london_uk,125
2022-01-01,edinburgh_uk,89
2022-01-02,madric_es,143
2022-01-02,london_uk,352
2022-01-01,edinburgh_uk,125
2022-01-01,madrid_es,431
2022-01-02,london_uk,151"""
header, *rows = (
tuple(map(str.strip, line.split(",")))
for line in map(str.strip, csv_input.splitlines())
)
tmp = {}
for date, city, size in rows:
key = (date, city.split("_")[0])
tmp[key] = tmp.get(key, 0) + int(size)
out = {}
for (date, city), size in tmp.items():
out.setdefault(date, []).append({city: size})
print(out)
Prints:
{
"2022-01-01": [{"london": 134}, {"madrid": 555}, {"edinburgh": 214}],
"2022-01-02": [{"edingburgh": 65}, {"london": 628}, {"madric": 143}],
}

Changing
survey_sum = defaultdict(dict)
to
survey_sum = defaultdict(lambda: defaultdict(int))
allows the return of
defaultdict(<function survey_sum.<locals>.<lambda> at 0x100edd8b0>, {'2022-01-01': defaultdict(<class 'int'>, {'london': 134, 'madrid': 555, 'edinburgh': 214}), '2022-01-02': defaultdict(<class 'int'>, {'edingburgh': 65, 'london': 628, 'madrid': 143})})
Allowing iterating over to create a list.

Related

Create a dictionary where the keys are values of dictionaries inside lists in a dictionary and the values are the number of times they appear

I have this dictionary of lists of dictionaries (I cannot change the structure for the work):
dict_countries = {'gb': [{'datetime': '1955-10-10 17:00:00', 'city': 'chester'},
{'datetime': '1974-10-10 23:00:00', 'city': 'chester'}],
'us': [{'datetime': '1955-10-10 17:00:00', 'city': 'hudson'}]
}
And the function:
def Seen_in_the_city(dict_countries:dict,)-> dict:
city_dict = {}
for each_country in dict_countries.values():
for each_sight in each_country:
citi = each_sight["city"]
if citi in city_dict.keys():
city_dict[each_sight["city"]] =+1
else:
city_dict[citi] =+1
return city_dict
I get:
{'chester': 1,'hudson': 1}
instead of
{'chester': 2,'hudson': 1}
You can try using Counter (a subclass of dict) from the collections module in the Python Standard Library:
from collections import Counter
c = Counter()
for key in dict_countries:
for d in dict_countries[key]:
c.update(v for k, v in d.items() if k == 'city')
print(c)
Output
Counter({'chester': 2, 'hudson': 1})
Try:
output = dict()
for country, cities in dict_countries.items():
for city in cities:
if city["city"] not in output:
output[city["city"]] = 0
output[city["city"]] += 1
You don't need to say +1 in order to add a positive number. Also in the if citi statement, += 1 means adding 1 to the existing value (1+1) where as =+1 is basically saying giving it a value of 1 once again.
if citi in city_dict.keys():
city_dict[each_sight["city"]] +=1
else:
city_dict[citi] = 1
You can use groupby from itertools
from itertools import groupby
print({i: len(list(j)[0]) for i,j in groupby(dict_countries.values(), key=lambda x: x[0]["city"])})
If you don't want additional imports (not that you shouldn't use Counter) here's another way:
dict_countries = {'gb': [{'datetime': '1955-10-10 17:00:00', 'city': 'chester'},
{'datetime': '1974-10-10 23:00:00', 'city': 'chester'}],
'us': [{'datetime': '1955-10-10 17:00:00', 'city': 'hudson'}]
}
def Seen_in_the_city(dict_countries:dict,)-> dict:
city_dict = {}
for each_country in dict_countries.values():
for each_sight in each_country:
citi = each_sight["city"]
city_dict[citi] = city_dict.get(citi, 0) + 1
return city_dict
print(Seen_in_the_city(dict_countries))

two dictionaries from text file where values are columns and key line counts

My file has this pattern (this is an example). I want to create two dictionaries with each column as values, and key as integers.
John Moor
Age 22
id 112
grade 60
Amy Ling
Age 22
id 114
grade 67
The dictionaries should look like this:
dict1 = {1 : ["John", "Age", "id", "grade"], 2 : ["Amy", "Age", "id", "grade"]}
dict2 = {1 : ["Moor", 22, 112, 60], 2 : ["Ling", 22, 114, 67]}
I did some digging but most solution given is - first column as key and second column as value.
This much I could think of, tried to split column by readlines().split()[0] but did not work.
f = open(fname)
d = {}
for line in f:
name, info = line.split()
d[name] = info
print(d)
Any suggestion? Any Help? how should I do it? :)
You can use defaultdict to store data and strip and split** function to manipulate the line that you read.
file.txt
John Moor
Age 22
id 112
grade 60
Amy Ling
Age 22
id 114
grade 67
Here is the Code:
from collections import defaultdict
dict1 = defaultdict(list)
dict2 = defaultdict(list)
with open("file.txt") as file:
idx = 1
for line in file.readlines():
stripped_line = line.strip()
if stripped_line:
items = stripped_line.split(" ")
dict1[idx].append(items[0])
try:
dict2[idx].append(int(items[-1]))
except:
dict2[idx].append(items[-1])
else:
idx += 1
print(f"dict1: {dict(dict1)}")
print(f"dict2: {dict(dict2)}")
Output:
dict1: {1: ['John', 'Age', 'id', 'grade'], 2: ['Amy', 'Age', 'id', 'grade']}
dict2: {1: ['Moor', 22, 112, 60], 2: ['Ling', 22, 114, 67]}
Update
It would be more meaningful if you could store data like:
from collections import defaultdict
dict1 = defaultdict(list)
with open("file.txt") as file:
is_first_line = True
key = None
for line in file.readlines():
stripped_line = line.strip()
if stripped_line:
items = stripped_line.split(" ")
if is_first_line:
key = items[0] + ' ' + items[-1]
else:
try:
dict1[key].append({items[0]: int(items[-1])})
except:
dict1[key].append({items[0]: items[-1]})
is_first_line = False
else:
is_first_line = True
print(f"dict1: {dict(dict1)}")
output:
dict1: {'John Moor': [{'Age': 22}, {'id': 112}, {'grade': 60}], 'Amy Ling': [{'Age': 22}, {'id': 114}, {'grade': 67}]}

Sorting API response with python to excel or csv. Python

I'm trying to sort out UK Police free API response to a readable format-csv or excel.
Im using Requests library. My initial code is getting the response in a json format:
import requests
r=requests.get('https://data.police.uk/api/crimes-street/all-crime?poly=51.169,-0.633:51.186,-0.5436:51.226,-0.6224&date=2019-12')
r_json=r.json()
for i in j:
for key,value in i.items():
print (key, ":", value)
The code above produces as follows:
category : anti-social-behaviour location_type : Force location : {'latitude': '51.196818', 'street': {'id': 1147343, 'name': 'On or near Parking Area'}, 'longitude': '-0.605146'} context : outcome_status : None persistent_id : id : 79955592 location_subtype : month : 2019-12
How can I create a table with correct headers for the response I get? Headers would be 'category', 'latitude', 'street', 'name', 'longitude', ' month'.
You need to get dipper in dictionary tree to get some data like latitude. Results are collected into collection of lists then loaded into data frame and saved as csv file.
import requests
import pandas as pd
r=requests.get('https://data.police.uk/api/crimes-street/all-crime?poly=51.169,-0.633:51.186,-0.5436:51.226,-0.6224&date=2019-12')
r_json=r.json()
# collect data into list of lists
collected_data = []
for data in r_json:
category = data.get('category')
month = data.get('month')
latitude = ''
longitude = ''
street = ''
for key, value in data.items():
if key == 'location':
latitude = value.get('latitude')
longitude = value.get('longitude')
street = value.get('street').get('name')
collected_data.append([category, latitude, longitude, street, month])
# load data into data frame
df = pd.DataFrame(collected_data, columns = ['Category' , 'Latitude', 'Longitude', 'Street', 'Month'])
# save data frame into csv
df.to_csv('data.csv')

Nested dictionary keeps overwriting data

I am trying to read in from a data file that has lines like:
2007 ANDREA 30 31.40 -71.90 05/13/18Z 25 1007 LOW
2007 ANDREA 31 31.80 -69.40 05/14/00Z 25 1007 LOW
I am trying to create a nested dictionary that has a key holding the year and then the nested dictionary will hold the name and a tuple containing statistics. I would like the return value to look like this:
{'2007': {'ANDREA': [(31.4, -71.9, '05/13/18Z', 25.0, 1007.0), (31.8, -69.4, '05/14/00Z', 25.0, 1007.0)]
However when I run the code it returns only one set of statistics. It seems to be overwriting itself because I am getting that last line of statistics in the txt file returned:
{'2007': {'ANDREA': [(31.8, -69.4, '05/14/00Z', 25.0, 1007.0)]
Here is the code:
def create_dictionary(fp):
'''Remember to put a docstring here'''
dict1 = {}
f = []
for line in fp:
a = line.split()
f.append(a)
for item in f:
a = (float(item[3]), float(item[4]), item[5], float(item[6]),
float(item[7]))
dict1 = update_dictionary(dict1, item[0], item[1], a))
print(dict1)
def update_dictionary(dictionary, year, hurricane_name, data):
if year not in dictionary:
dictionary[year] = {}
if hurricane_name not in dictionary:
dictionary[year][hurricane_name] = [data]
else:
dictionary[year][hurricane_name].append(data)
else:
if hurricane_name not in dictionary:
dictionary[year][hurricane_name] = [data]
else:
dictionary[year][hurricane_name].append(data)
return dictionary
These lines:
if hurricane_name not in dictionary:
...should be:
if hurricane_name not in dictionary[year]:
Since I was a little late here's a suggestion instead of an answer to your original question. You can simplify the logic a bit because when the year doesn't exist then the name also can't exist for that year. Everything can be put in a single function and using a "with" statement to open the file will ensure it is properly closed even if your program encounters an error.
def build_dict(file_path):
result = {}
with open(file_path, 'r') as f:
for line in f:
items = line.split()
year, name, data = items[0], items[1], tuple(items[2:])
if year in result:
if name in result[year]:
result[year][name].append(data)
else:
result[year][name] = [data]
else:
result[year] = {name: [data]}
return result
print(build_dict(file_path))
Output:
{'2007': {'ANDREA': [('30', '31.40', '-71.90', '05/13/18Z', '25', '1007', 'LOW'), ('31', '31.80', '-69.40', '05/14/00Z', '25', '1007', 'LOW')]}}

Creating lists from the dictionary or just simply sort it

I have the following code:
import os
import pprint
file_path = input("Please, enter the path to the file: ")
if os.path.exists(file_path):
worker_dict = {}
k = 1
for line in open(file_path,'r'):
split_line = line.split()
worker = 'worker{}'.format(k)
worker_name = '{}_{}'.format(worker, 'name')
worker_yob = '{}_{}'.format(worker, 'yob')
worker_job = '{}_{}'.format(worker, 'job')
worker_salary = '{}_{}'.format(worker, 'salary')
worker_dict[worker_name] = ' '.join(split_line[0:2])
worker_dict[worker_yob] = ' '.join(split_line[2:3])
worker_dict[worker_job] = ' '.join(split_line[3:4])
worker_dict[worker_salary] = ' '.join(split_line[4:5])
k += 1
else:
print('Error: Invalid file path')
File:
John Snow 1967 CEO 3400$
Adam Brown 1954 engineer 1200$
Output from worker_dict:
{
'worker1_job': 'CEO',
'worker1_name': 'John Snow',
'worker1_salary': '3400$',
'worker1_yob': '1967',
'worker2_job': 'engineer',
'worker2_name': 'Adam Brown',
'worker2_salary': '1200$',
'worker2_yob': '1954',
}
And I want to sort data by worker name and after that by salary. So my idea was to create a separate list with salaries and worker names to sort. But I have problems with filling it, maybe there is a more elegant way to solve my problem?
import os
import pprint
file_path = input("Please, enter the path to the file: ")
if os.path.exists(file_path):
worker_dict = {}
k = 1
with open(file_path,'r') as file:
content=file.read().splitlines()
res=[]
for i in content:
val = i.split()
name = [" ".join([val[0],val[1]]),]#concatenate first name and last name
i=name+val[2:] #prepend name
res.append(i) #append modified value to new list
res.sort(key=lambda x: x[3])#sort by salary
print res
res.sort(key=lambda x: x[0])#sort by name
print res
Output:
[['Adam Brown', '1954', 'engineer', '1200$'], ['John Snow', '1967', 'CEO', '3400$']]
[['Adam Brown', '1954', 'engineer', '1200$'], ['John Snow', '1967', 'CEO', '3400$']]
d = {
'worker1_job': 'CEO',
'worker1_name': 'John Snow',
'worker1_salary': '3400$',
'worker1_yob': '1967',
'worker2_job': 'engineer',
'worker2_name': 'Adam Brown',
'worker2_salary': '1200$',
'worker2_yob': '1954',
}
from itertools import zip_longest
#re-group:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
#re-order:
res = []
for group in list(grouper(d.values(), 4)):
reorder = [1,2,0,3]
res.append([ group[i] for i in reorder])
#sort:
res.sort(key=lambda x: (x[1], x[2]))
output:
[['Adam Brown', '1200$', 'engineer', '1954'],
['John Snow', '3400$', 'CEO', '1967']]
Grouper is defined and explained in itertools. I've grouped your dictionary by records pertaining to each worker, returned it as a reordered list of lists. As lists, I sort them by the name and salary. This is solution is modular: it distinctly groups, re-orders and sorts.
I recommend to store the workers in a different format, for example .csv, then you could use csv.DictReader and put it into a list of dictionaries (this would also allow you to use jobs, names, etc. with more words like "tomb raider").
Note that you have to convert the year of birth and salary to ints or floats to sort them correctly, otherwise they would get sorted lexicographically as in a real world dictionary (book) because they are strings, e.g.:
>>> sorted(['100', '11', '1001'])
['100', '1001', '11']
To sort the list of dicts you can use operator.itemgetter as the key argument of sorted, instead of a lambda function, and just pass the desired key to itemgetter.
The k variable is useless, because it's just the len of the list.
The .csv file:
"name","year of birth","job","salary"
John Snow,1967,CEO,3400$
Adam Brown,1954,engineer,1200$
Lara Croft,1984,tomb raider,5600$
The .py file:
import os
import csv
from operator import itemgetter
from pprint import pprint
file_path = input('Please, enter the path to the file: ')
if os.path.exists(file_path):
with open(file_path, 'r', newline='') as f:
worker_list = list(csv.DictReader(f))
for worker in worker_list:
worker['salary'] = int(worker['salary'].strip('$'))
worker['year of birth'] = int(worker['year of birth'])
pprint(worker_list)
pprint(sorted(worker_list, key=itemgetter('name')))
pprint(sorted(worker_list, key=itemgetter('salary')))
pprint(sorted(worker_list, key=itemgetter('year of birth')))
You still need some error handling, if a int conversion fails, or just let the program crash.

Categories