Previously i got average prices for gas like this:
1993 1.0711538461538466
1994 1.0778653846153845
1995 1.1577115384615386
1996 1.2445283018867925
1997 1.2442499999999999
I want to get a graph, so i tried this code.
import matplotlib.pyplot as plt
import numpy as np
with open('c:/Gasprices.txt', 'r') as file:
td = dict()
for line in file:
year = line[6:10]
price = float(line[11:])
td.setdefault(year, []).append(price)
for k, v in td.items():
print(f'{k} {sum(v) / len(v):}')
x=[k], y=[sum(v) / len(v)]
plt.plot(x,y, 'o--')
plt.title('Average gas price per year in US')
plt.xlabel('year')
plt.ylabel('Avg.gas price per gallon[$]')
plt.grid()
plt.yticks(np.arange(1.0, 3.5, 0.5))
plt.tight_layout()
plt.show()
But i don't know what should i put in x and y.
I copied your code and sample data and here is my assessment and answer:
First of all, just to answer your question in plain English, ' X ' is usually used for time and ' Y ' is usually for values that have changed over time although based on your code I think you know that, so ' Year ' would be ' X ' and ' Average Gas Price ' would be ' Y'.
Next I did some modifications to your code to make it work (which I will explain after):
import matplotlib.pyplot as plt
import numpy as np
x = list()
y = list()
with open('e:/Gasprices.txt', 'r') as file:
for line in file:
line_striped = line.strip()
if line_striped == '':
continue
line_splitted = line_striped.split(' ')
x.append(
int(line_splitted[0])
)
y.append(
float(line_splitted[1])
)
plt.plot(x,y, 'o--')
plt.title('Average gas price per year in US')
plt.xlabel('year')
plt.ylabel('Avg.gas price per gallon[$]')
plt.grid()
plt.yticks(np.arange(1.0, 3.5, 0.5))
plt.tight_layout()
plt.show()
So what's going on:
I created two empty lists before opening the file and subsequently iterating through its lines
Next inside the loop I made sure to skip over any empty line by using .strip() which removes whitespace from the beginning and end of a string and then comparing the result to an empty string, if the resulting string is equal to an empty string, skip the line entirely.
And then and splitted the striped line by whitespace between characters which resulted in getting a list of two items, first element is the year and the second element is the average price.
Finally I added each value to its respective list, year to X and average price to Y.
I hope it helped.
P.S. My english is not very good so I hope my answer is somewhat readable.
Related
I need help with this exercise. I have a csv file like this one in one column but with 16000 entries:
Entity,Code,Year,Life expectancy (years)
Afghanistan,AFG,1950,27.638
Afghanistan,AFG,1951,27.878
Serbia,SRB,1995,71.841
Zimbabwe,ZWE,2019,61.49
I need to print this, I got the first 2 parts working.
✅What is the year and country that has the lowest life expectancy in the dataset?
✅What is the year and country that has the highest life expectancy in the dataset?
❌Allow the user to type in a year, then, find the average life expectancy for that year. Then find the country with the minimum and the one with the maximum life expectancy for that year.
So far I´m here and I need some help on how to get the last part to print something like this related to the year input by the user:
For the year 1959:
The average life expectancy across all countries was 54.95
The max life expectancy was in Norway with 73.49
The min life expectancy was in Mali with 28.077
import csv
print ("Enter a year to find the average life expectancy for that year: ")
input_year = input ("\n""YEAR: ")
#Allow the user to type in a year, then, find the average life expectancy for that year
def subset_year(all_data, selected_year):
year_only = []
for entity, code, year, expectancy in all_data:
if year == selected_year:
year_only.append((entity, code, year, expectancy))
return year_only
def pr_row(headers, row):
return ", ".join(f"{label}:{value}" for label, value in zip(headers, row))
data = []
with open(r"C:\Users\X230\Desktop\Python class\life-expectancy.csv") as database:
reader = csv.reader(database)
# the first row in the CSV file is; Entity,Code,Year,Life expectancy (years)
# example of the data in the CVS file: Afghanistan,AFG,1950,27.638
parts = next(reader)
for line in reader:
# print(line) #this prints everything, not very useful so I removed it
# Save the parts I need into variables
entity, code, year, expectancy = line
data.append([entity, code, int(year), float(expectancy)])
def key_year(row):
return row[3]
print()
print("The overall max life expectancy is: ", pr_row(parts, max(data, key=key_year)))
print("The overall min life expectancy is: ", pr_row(parts, min(data, key=key_year)))
print("The average life expectancy across all countries was: ", ) #??????????????
print("The max life expectancy was in: ", ) #????????????????????????????????????
print("The min life expectancy was in: ", ) #????????????????????????????????????
year = input_year
all_by_year = subset_year(data, year)
print(all_by_year)
The first thing I would do is iterate over the lines of the file and separate the rows that are useful. Then, I would use the map() function to get the life expectancy of each row (without having to use a for loop to iterate over each one of them), and by putting that into min() and max() functions you can easily get the minimum and maximum value.
For the average, I just used sum() to get the sum of all selected values and divided that by the length of a list containing all those values (basically, the number of values). You can use mean() to get the average too, but you would need to import the statistics module first.
Finally, it iterates over the selected rows until it finds the row which contains the minimum/maximum values, just to print it along with the country of that row.
import csv
year_input = input('Enter the year: ')
with open('data.csv','r') as file:
reader = csv.reader(file)
lines = []
# We iterate over the lines in the file, excluding the header of course.
# If the year matches the user input, then we append that row to a list.
for line in reader:
if line == ['Entity', 'Code', 'Year', 'Life expectancy (years)']:
pass
else:
if line[2] == year_input:
lines.append(line)
# Once we have that list, we get the average, the minimum and the maximum value for the life expectancy.
average = sum(map(lambda x: float(x[3]), lines))/len(list(map(lambda x: x[3], lines)))
minimum = min(map(lambda x: x[3], lines))
maximum = max(map(lambda x: x[3], lines))
print('For the year ' + year_input + ':')
print('The average life expectancy across all countries was ' + str(average))
# Now, we iterate over the rows we selected before until we find the minimum and maximum value.
# When we find the row with the minimum/maximum value, we print it next to the country on that same row.
for line in lines:
if line[3] == minimum:
print('The min life expectancy was in ' + line[0] + ' with ' + line[3])
if line[3] == maximum:
print('The max life expectancy was in ' + line[0] + ' with ' + line[3])
I am trying to create mutliple horizontal barplots for a dataset. The data deals with race times from a running race.
Dataframe has the following columns: Name, Age Group, Finish Time, Finish Place, Hometown. Sample data below.
Name
Age Group
Finish Time
Finish Place
Hometown
Times Ran The Race
John
30-39
15.5
1
New York City
2
Mike
30-39
17.2
2
Denver
1
Travis
40-49
20.4
1
Louisville
3
James
40-49
22.1
2
New York City
1
I would like to create a bar plot similar to what is shown below. There would be 1 bar chart per age group, fastest runner on bottom of chart, runner name with city and number of times ran the race below their name.
Do I need a for loop or would a simple groupby work? The number and sizing of each age group can be dynamic based off the race so it is not a constant, but would be dependent on the dataframe that is used for each race.
I employed a looping process. I use the extraction by age group as a temporary data frame, and then accumulate label information for multiple x-axis to prepare for reuse. The accumulated label information is decomposed into strings and stored in a new list. Next, draw a horizontal bar graph and update the labels on the x-axis.
for ag in df['Age Group'].unique():
label_all = []
tmp = df[df['Age Group'] == ag]
labels = [[x,y,z] for x,y,z in zip(tmp.Name.values, tmp.Hometown.values, tmp['Times Ran The Race'].values)]
for k in range(len(labels)):
label_all.append(labels[k])
l_all = []
for l in label_all:
lbl = l[0] + '\n'+ l[1] + '\n' + str(l[2]) + ' Time'
l_all.append(lbl)
ax = tmp[['Name', 'Finish Time']].plot(kind='barh', legend=False)
ax.set_title(ag +' Age Group')
ax.set_yticklabels([l_all[x] for x in range(len(l_all))])
ax.grid(axis='x')
for i in ['top','bottom','left','right']:
ax.spines[i].set_visible(False)
Here's a quite compact solution. Only tricky part is the ordinal number, if you really want to have that. I copied the lambda solution from Ordinal numbers replacement
Give this a try and please mark the answer with Up-button if you like it.
import matplotlib.pyplot as plt
ordinal = lambda n: "{}{}".format(n,"tsnrhtdd"[(n/10%10!=1)*(n%10<4)*n%10::4])
for i, a in enumerate(df['Age Group'].unique()):
plt.figure(i)
dfa = df.loc[df['Age Group'] == a].copy()
dfa['Info'] = dfa.Name + '\n' + dfa.Hometown + '\n' + \
[ordinal(row) for row in dfa['Times Ran The Race']] + ' Time'
plt.barh(dfa.Info, dfa['Finish Time'])
plt.title(f'{a} Age Group')
plt.xlabel("Time (Minutes)")
I'm a student researcher who's running simulations on exoplanets to determine if they might be viable for life. The software I'm using, outputs a file with several columns of various types of data. So far, I've written a python script that goes through one file and grabs two columns of data. In this case, time and global temperature of the planet.
What I want to do is:
Write a python script that goes through multiple files, and grabs the same two columns that my current script does.
Then, I want to create subplots of all these files
The things that will stay consistent across all of the files, is the fact that times doesn't change, the x axis will always be time (from 0 to 1 million years). The y axis values will changes across simulations though.
This is what I got so far for my code:
import math as m
import matplotlib.pylab as plt
import numpy as np
## Set datafile equal to the file I plan on using for data, and then open it
datafile = r"C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]\solarsys.Earth.forward"
file = open(datafile, "r")
# Create two empty arrays for my x and y axis of my graphs
years = [ ]
GlobalT = [ ]
# A for loop that looks in my file, and grabs only the 1st and 8th column adding them to the respective arrays
for line in file:
data = line.split(' ')
years.append(float(data[0]))
GlobalT.append(float(data[7]))
# Close the file
file.close()
# Plot my graph
fig = plt.matplotlib.pyplot.gcf()
plt.plot(years, GlobalT)
plt.title('Global Temperature of GJ 229 b over time')
fig.set_size_inches(10, 6, forward=True)
plt.figtext(0.5, 0.0002, "This shows the global temperature of GJ 229 b when it's semi-major axis is 0.929 au, \n"
" and it's actual mass relative to the sun (~8 Earth Masses)", wrap=True, horizontalalignment='center', fontsize=12)
plt.xlabel(" Years ")
plt.ylabel("Global Temp")
plt.show()
I think the simplest thing to do is to turn your code for one file into a function, and then call it in a loop that iterates over the files.
from pathlib import Path
def parse_datafile(pth):
"""Parses datafile"""
results = []
with pth.open('r') as f:
for line in f:
data = line.split(' ')
results.append({'f': pth.stem,
'y': data[0],
't': data[7]})
return results
basedir = Path(r'C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]')
# assuming you want to parse all files in directory
# if not, can change glob string for files you want
all_results = [parse_datafile(pth) for pth in basedir.glob('*')]
df = pd.DataFrame(all_results)
df['y'] = pd.to_numeric(df['y'], errors='coerce')
df['t'] = pd.to_numeric(df['t'], errors='coerce')
This will give you a dataframe with three columns - f (the filename), y (the year), and t (the temperature). You then have to convert y and t to numeric dtypes. This will be faster and handle errors more gracefully than your code, which will raise an error with any malformed data.
You can further manipulate this as needed to generate your plots. Definitely check if there are any NaN values and address them accordingly, either by dropping those rows or using fillna.
I am trying to filter CSV file where I need to store prices of different commodities that are > 1000 in different arrays, I can able to get only 1 commodity values perfectly but other commodity array just a duplicate of the 1st commodity.
CSV file looks like below figure:
CODE
import matplotlib.pyplot as plt
import csv
import pandas as pd
import numpy as np
# csv file name
filename = "CommodityPrice.csv"
# List gold price above 1000
gold_price_above_1000 = []
palladiun_price_above_1000 = []
gold_futr_price_above_1000 = []
cocoa_future_price_above_1000 = []
df = pd.read_csv(filename)
commodity = df["Commodity"]
price = df['Price']
for gold_price in price:
if (gold_price <= 1000):
break
else:
for gold in commodity:
if ('Gold' == gold):
gold_price_above_1000.append(gold_price)
break
for palladiun_price in price:
if (palladiun_price <= 1000):
break
else:
for palladiun in commodity:
if ('Palladiun' == palladiun):
palladiun_price_above_1000.append(palladiun_price)
break
for gold_futr_price in price:
if (gold_futr_price <= 1000):
break
else:
for gold_futr in commodity:
if ('Gold Futr' == gold_futr):
gold_futr_price_above_1000.append(gold_futr_price)
break
for cocoa_future_price in price:
if (cocoa_future_price <= 1000):
break
else:
for cocoa_future in commodity:
if ('Cocoa Future' == cocoa_future):
cocoa_future_price_above_1000.append(cocoa_future_price)
break
print(gold_price_above_1000)
print(palladiun_price_above_1000)
print(gold_futr_price_above_1000)
print(cocoa_future_price_above_1000)
plt.ylim(1000, 3000)
plt.plot(gold_price_above_1000)
plt.plot(palladiun_price_above_1000)
plt.plot(gold_futr_price_above_1000)
plt.plot(cocoa_future_price_above_1000)
plt.title('Commodity Price(>=1000)')
y = np.array(gold_price_above_1000)
plt.ylabel("Price")
plt.show()
print("SUCCESS")
Here is my question in detail,
Please use pandas and matplotlib to sort out the data in the csv and output and store the sorted data into the process chart. The output results are shown in the following figures.
Figure 1 The upper picture is to take all the products with Price> = 1000 in csv, mark all their prices in April and May and draw them into a linear graph. When outputting, the year in the date needs to be removed. The label name is marked and displayed. The title names of the chart, x-axis, and y- axis need to be marked. The range of the y-axis falls within 1000 ~ 3000, and the color of the line is not specified.
Figure 1 The picture below is from all the products with Price> = 1000 in csv. Mark their Change% in April and May and draw them into a dotted line graph. The dots need to be in a dot style other than '.' And 'o'. To mark, please mark the line with a line other than a solid line. When outputting, you need to remove the year from the date. You need to mark and display the label name of each line. The title names of the chart, x-axis, and y-axis must be marked. You need to add a grid line, the y-axis range falls from -15 to +15, and the color of the line is not specified.
The upper and lower two pictures in Figure 2 are changed to 1000> Price> = 500. The other conditions are basically the same as in Figure 1, except that the points and lines of the dot and line diagrams below Figure 2 need to use different styles from Figure 1.
The first and second pictures in Figure 1 must be displayed in the same window, as is the picture in Figure 2.
All of your blocks of code are doing the exact same thing. Changing the same of the iterator doesn't change what it does.
for gold_price in price:
for palladiun_price in price:
for gold_futr_price in price:
for cocoa_future_price in price:
This is going through the exact same data. You haven't subsetted for specific commodities.
Using the break statement in that loop doesn't make sense either. It should be a pass.
Basically for every number above 1000, you iterate through your entire Commodities column and add number to the list for every time you see a specific commodity.
Read how to index and select data in pandas.
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
gold_price_above_1000 = df[(df.Commodity=='Gold') & (df.Price>1000)]
I have this current code below. The way the code is supposed to work is ask for a user name and then print all the projects and hours associated to that user within the called csv and then plot a vertical bar graph with the x-values being the project names and the y values being the hours for that project. I'm running into an error which is also shown below. How do I pull the right values from my code and plot them accordingly? I'm very novice with python so any help you can lend would be great.
Thanks in advance
import csv
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
with open('Report1.csv') as csvfile:
hour_summation = {}
read_csv = csv.reader(csvfile, delimiter=',')
name_found = False
while not name_found:
# take user input for name
full_name = input('Enter your full name: ').lower()
# start search at top of file. Have to do this or each loop will
# start at the end of the file after the first check.
csvfile.seek(0)
for row in read_csv:
# check to see if row matches name
if(' '.join((row[0], row[1]))).lower() == full_name.strip().lower():
name_found = True
hour_summation[row[2]] = hour_summation.get(row[2], 0) + int(float(row[3]))
# name wasn't found, so repeat process
if not name_found:
# name wasn't found, go back to start of while loop
print('Name is not in system')
else:
# break out of while loop. Technically not required since while
# will exit here anyway, but clarity is nice
break
print('This is {} full hours report:'.format(full_name))
for k, v in hour_summation.items():
print(k + ': ' + str(v) + ' hours')
for k, v in hour_summation.items():
x = np.array(k)
y = np.array(v)
y_pos = np.arange(len(y))
plt.bar(y_pos, y, align='center', alpha=0.5)
plt.xticks(y_pos, x)
plt.ylabel('hours')
plt.title('Projects for: ', full_name)
plt.show()
Here is the error I'm getting. It is doing everything as it should until it reaches the graph section.
Enter your full name: david ayers
This is david ayers full hours report:
AD-0001 Administrative Time: 4 hours
AD-0002 Training: 24 hours
AD-0003 Personal Time: 8 hours
AD-0004 Sick Time: 0 hours
OPS-0001 General Operational Support: 61 hours
SUP-2507 NAFTA MFTS OS Support: 10 hours
ENG-2001_O Parts Maturity-Overhead: 1 hours
ENG-2006_O CHEC 2 Tool-Overhead: 19 hours
FIN-2005 BLU Lake BOM Analytics: 52 hours
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-c9c8a4aa3176> in <module>()
36 x = np.array(k)
37 y = np.array(v)
---> 38 y_pos = np.arange(len(y))
39 plt.bar(y_pos, y, align='center', alpha=0.5)
40 plt.xticks(y_pos, x)
TypeError: len() of unsized object