Averaging months in a year within a csv file - python

Say I have a csv file as follows:
GL000004250,1958.0833333333333,-1.4821428571428572
GL000004250,1958.1666666666667,-2.586206896551724
GL000004250,1958.25,-1.5733333333333333
GL000004250,1958.3333333333333,4.680000000000001
GL000004250,1958.4166666666667,9.944827586206895
GL000004250,1958.5,12.874193548387098
GL000004250,1958.5833333333333,12.21290322580645
GL000004250,1958.6666666666667,7.18148148148148
GL000004250,1958.75,2.187096774193549
GL000004250,1958.8333333333333,-0.9066666666666666
GL000004250,1958.9166666666667,0.3777777777777777
GL000004250,1959.0,0.43214285714285744
GL000004250,1959.0833333333333,-6.432142857142857
GL000004250,1959.1666666666667,-6.806451612903226
GL000004250,1959.25,0.6933333333333334
GL000004250,1959.3333333333333,5.780645161290322
GL000004250,1959.4166666666667,8.343333333333332
GL000004250,1959.5,10.71935483870968
GL000004250,1959.5833333333333,10.216129032258062
Where the second column is the year in decimal form and the third column is the data. I would like the program to find all the values from 1958 and average them, then 1959 and average them, etc.

If you're a beginner, start with the basics. Try with a loop and a dictionary to get a better handle on Python.
import numpy as np
with open(csvfile,'r') as f:
yearAvgs = dict()
data = f.read().split('\n')
for line in data:
if line:
year = int(float(line.split(',')[1]))
val = float(line.split(',')[2])
if year not in yearAvgs:
yearAvgs[year] = []
yearAvgs[year].append(val)
for k, v in yearAvgs.items():
avg = np.mean(v)
print ("Year = ",k,": Mean = ",avg)
Edit: If you're looking for a solution with pandas:
import pandas as pd
df = pd.read_csv(csvfile,names=['ID','Year','Value'])
df['Year'] = df['Year'].astype(int)
df.groupby(['Year']).mean()

Related

How would I split a list of tuples into chunks

I have a list that looks something like this:
weather_history=((year,month,day),precip,tmin,tmax)
I need to split it into one-year chunks where each chunk is a list with one years worth of data
please help!
all_years_data: List[Weather_List] = []
for line,p,n,x in weather_history:
year=line[0]
day=line[2]
month=line[1]
precip=p
tmin=n
tmax=x
if year not in all_years_data:
all_years_data.append(year)
this is my code so far. I've tried many different things to get all of each years worth of data into one list but can't figure it out
How about this?
A = [((1,2,3),4,5,6), ((10,20,30),40,50,60), ((100,200,300),400,500,600)]
B = [i[0][0] for i in A]
If your data is like this:
weather_history = ((2020,6,12),30,10,40)
you can use weather_history index without for statement :
year = weather_history[0][0]
day = weather_history[0][1]
month = weather_history[0][2]
precip = weather_history[1]
tmin = weather_history[2]
tmax = weather_history[3]
if year not in all_years_data:
all_years_data.append(year)
But if your data is like this:
weather_history = [((2020,6,12),30,10,40),((2021,6,12),30,10,40),((2022,6,12),30,10,40)]
you should loop weather_history data with for statement :
for line in weather_history:
year = line[0][0]
day = line[0][1]
month = line[0][2]
precip = line[1]
tmin = line[2]
tmax = line[3]
if year not in all_years_data:
all_years_data.append(year)

Line Plot based on a Pandas DataFrame

I'm trying to learn how to analyze data in python, so I'm using a database that I've already did some work on it with PowerBI, but now I'm trying to do the same plots with python.
The Pandas dataframe is this...
And I'm trying to build a line plot like this one...
This line represents the amount of 'Água e sabonete' and 'Fricção com álcool' in the column Ação divided by the the totals of Ação.
This was how managed to do it on PowerBI using Dax:
Adesão = VAR nReal = (COUNTROWS(FILTER(Tabela1,Tabela1[Ação]="Água e sabonete")) + COUNTROWS(FILTER(Tabela1,Tabela1[Ação]="Fricção com álcool")))
//VAR acao = COUNTA(Tabela1[Ação]
RETURN
DIVIDE(nReal,COUNTA(Tabela1[Ação]))
I want to know if it is possible to do something similar to build the plot or if there is other way to build it in python.
I didn't try anything especifically, but I think that should be possible to build it with a function, but it is too difficult to me right now to create one since I'm a beginner.
Any help would be greatly appreciated!
The idea here is to get access to each month and count every time Água e sabonete and Fricção com álcool appear.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def dateprod(year): #could change this with an array of each month instead
year = str(year) + "-"
dates = []
for i in range(1,13):
if i >= 10:
date = year + str(i)
else:
date = year + "0" + str(i)
dates.append(date)
return dates
def y_construct(year,search_1,search_2):
y_vals = np.zeros(0)
for j in dateprod(2001):
score = 0
temp_list = df.loc[str(j)]["aloc"].values #Gets all values of each month
for k in temp_list:
if k == str(search_1) or str(search_2): #Cheak if the words in the arrays are time Água e sabonete and Fricção com álcool
score += 1
y_vals = np.append(y_vals,score)
return y_vals/df.size # divid by total. Assuming that the amount of Ação is the same as df.size
# this does the plotting
y_vals = y_construct(2022,"Água e sabonete","Fricção com álcool")
x_label = ["jan","feb","mar","apr","may","jul","juli","aug","sep","otc","nov","dec"]
plt.ylabel("%")
plt.plot(x_label,y_vals)
plt.show()

How to return result based on a string found on a list?

I'm trying to return all data from my excel sheet from the column TOURNAMENT that has the string FIFA. I keep getting no results back and am not sure how to fix this. Below is a sample of data from my excel. Any insight would be helpful thank you
My excel:
import pandas as pd
import numpy as np
filename = ("results.csv")
df = pd.read_csv(filename)
#convert to datetime format
df['date'] = pd.to_datetime(df['date'], format='%Y/%M/%D')
#Which country has scored the most goals in FIFA events (qualifiers, cups, etc.) since 2010?
#To get the most goals by sum
df['total_score'] = df['home_score'] + df['away_score']
#Not sure how to check all data with the string "FIFA" in the column "Tournament"
sub_df = df[(df['date'].dt.year >= 2010)]
if "FIFA" in df['tournament']:
sub_df2 = sub_df[sub_df['total_score'] == sub_df['total_score'].max()]
print(sub_df2)
else:
print("no results")
You can use Series.str.contains to check if a substring exists in the value, then use the masking to get only such occurrences:
>>> df[df['tournament'].str.contains('FIFA')]

Reading a csv file and counting a row depending on another row

I have a csv file where i need to read different columns and sum their numbers up depending on another row in the dataset.
The question is:
How do the flight phases (ex. take off, cruise, landing..) contribute
to fatalities?
I have to sum up column number 23 for each different data in column 28.
I have a solution with masks and a lot of IF statements:
database = pd.read_csv('Aviation.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = database.as_matrix()
TOcounter = 0
for r in data:
if r[28] == "TAKEOFF":
TOcounter += r[23]
print(TOcounter)
This example shows the general idea of my solution. Where i would have to add a lot of if statements and counters for every different data in column 28.
But i was wondering if there is a smarter solution to the issue.
The raw data can be found at: https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv
It sounds like what you are trying to achieve is
df.groupby('Broad.Phase.of.Flight')['Total.Fatal.Injuries'].sum()
This is a quick solution, not checking for errors like if can convert a string for float. Also you should think about in searching for the right column(with text) instead of reliing on the column index (like 23 and 28)
but this should work:
import csv
import urllib2
import collections
url = 'https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv'
response = urllib2.urlopen(url)
df = csv.reader(response)
d = collections.defaultdict(list)
for i,row in enumerate(df):
key = row[28]
if key == "" or i == 0 : continue
val = 0 if(row[23]) =="" else float(row[23])
d.setdefault(key,[]).append(val)
d2 = {}
for k, v in d.iteritems(): d2[k] = sum(v)
for k, v in d2.iteritems(): print "{}:{}".format(k,v)
Result:
TAXI:110.0
STANDING:193.0
MANEUVERING:6430.0
DESCENT:1225.0
UNKNOWN:919.0
TAKEOFF:5267.0
LANDING:592.0
OTHER:107.0
CRUISE:6737.0
GO-AROUND:783.0
CLIMB:1906.0
APPROACH:4493.0

parsing CSV in pandas

I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.
You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']
If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time

Categories