I have a list that looks something like this:
weather_history=((year,month,day),precip,tmin,tmax)
I need to split it into one-year chunks where each chunk is a list with one years worth of data
please help!
all_years_data: List[Weather_List] = []
for line,p,n,x in weather_history:
year=line[0]
day=line[2]
month=line[1]
precip=p
tmin=n
tmax=x
if year not in all_years_data:
all_years_data.append(year)
this is my code so far. I've tried many different things to get all of each years worth of data into one list but can't figure it out
How about this?
A = [((1,2,3),4,5,6), ((10,20,30),40,50,60), ((100,200,300),400,500,600)]
B = [i[0][0] for i in A]
If your data is like this:
weather_history = ((2020,6,12),30,10,40)
you can use weather_history index without for statement :
year = weather_history[0][0]
day = weather_history[0][1]
month = weather_history[0][2]
precip = weather_history[1]
tmin = weather_history[2]
tmax = weather_history[3]
if year not in all_years_data:
all_years_data.append(year)
But if your data is like this:
weather_history = [((2020,6,12),30,10,40),((2021,6,12),30,10,40),((2022,6,12),30,10,40)]
you should loop weather_history data with for statement :
for line in weather_history:
year = line[0][0]
day = line[0][1]
month = line[0][2]
precip = line[1]
tmin = line[2]
tmax = line[3]
if year not in all_years_data:
all_years_data.append(year)
Related
I can't work out why the dataframe "newTimeDF" I am adding to is empty at the end of the for loop:
timeZonesDF = pd.DataFrame{"timeZoneDate": [2018-03-11, 2018-11-04]}
newTimeDF = pd.DataFrame(columns = ["startDate", "endDate"])
for yearRow, yearData in timeZonesDF.groupby(pd.Grouper(freq="A")):
DST_start = pd.to_datetime(yearData.iloc[0]["timeZoneDate"])
DST_end = pd.to_datetime(yearData.iloc[-1]["timeZoneDate"])
newTimeDF["startDate"] = DST_start
newTimeDF["endDate"] = DST_end
continue
Can someone please point out what I am missing, is there something about groupby for-loops which is different?
The code you have here:
newTimeDF["startDate"] = DST_start
newTimeDF["endDate"] = DST_end
is setting the startDate column equal to DST_start for all rows and the endDate column equal to DST_end for all rows. because at the time of running this code your dataframe has no rows, nothing is changed in your final product.
What you could do is create a dictionary from your two values like so:
tempdic = {"startDate" : DST_start, "endDate" : DST_end}
Then append that dictionary to your dataframe to add a row.
newTimeDF.append(tempdic, ignore_index=True)
Making your code look something like this
for yearRow, yearData in timeZonesDF.groupby(pd.Grouper(freq="A")):
DST_start = pd.to_datetime(yearData.iloc[0]["timeZoneDate"])
DST_end = pd.to_datetime(yearData.iloc[-1]["timeZoneDate"])
tempdic = {"startDate" : DST_start, "endDate" : DST_end}
newTimeDF = newTimeDF.append(tempdic, ignore_index=True)
I have a time series that looks something like these
fechas= pd.Series(pd.date_range(start='2015-01-01', end='2020-12-01', freq='H'))
data=pd.Series(range(len(fechas)))
df=pd.DataFrame({'Date':fechas, 'Data':data})
What I need to do is the sum of every day and group by year, what I did and works is
df['year']=pd.DatetimeIndex(df['Date']).year
df['month']=pd.DatetimeIndex(df['Date']).month
df['day']=pd.DatetimeIndex(df['Date']).day
df.groupby(['year','month','day'])['Data'].sum().reset_index()
But what I need is to have the years in the columns to look something like this
res=pd.DataFrame(columns=['dd-mm','2015','2016','2017','2018','2019','2020']
This might be what you need:
df = pd.DataFrame({'Date':fechas, 'Data':data})
df = df.groupby(pd.DatetimeIndex(df["Date"]).date).sum()
df.index = pd.to_datetime(df.index)
df["dd-mm"] = df.index.strftime("%d-%m")
output = pd.DataFrame(index=df["dd-mm"].unique())
for yr in range(2015, 2021):
temp = df[df.index.year==yr]
temp = temp.set_index("dd-mm")
output[yr] = temp
output = output.reset_index() #if you want to have dd-mm as a column instead of the index
Say I have a csv file as follows:
GL000004250,1958.0833333333333,-1.4821428571428572
GL000004250,1958.1666666666667,-2.586206896551724
GL000004250,1958.25,-1.5733333333333333
GL000004250,1958.3333333333333,4.680000000000001
GL000004250,1958.4166666666667,9.944827586206895
GL000004250,1958.5,12.874193548387098
GL000004250,1958.5833333333333,12.21290322580645
GL000004250,1958.6666666666667,7.18148148148148
GL000004250,1958.75,2.187096774193549
GL000004250,1958.8333333333333,-0.9066666666666666
GL000004250,1958.9166666666667,0.3777777777777777
GL000004250,1959.0,0.43214285714285744
GL000004250,1959.0833333333333,-6.432142857142857
GL000004250,1959.1666666666667,-6.806451612903226
GL000004250,1959.25,0.6933333333333334
GL000004250,1959.3333333333333,5.780645161290322
GL000004250,1959.4166666666667,8.343333333333332
GL000004250,1959.5,10.71935483870968
GL000004250,1959.5833333333333,10.216129032258062
Where the second column is the year in decimal form and the third column is the data. I would like the program to find all the values from 1958 and average them, then 1959 and average them, etc.
If you're a beginner, start with the basics. Try with a loop and a dictionary to get a better handle on Python.
import numpy as np
with open(csvfile,'r') as f:
yearAvgs = dict()
data = f.read().split('\n')
for line in data:
if line:
year = int(float(line.split(',')[1]))
val = float(line.split(',')[2])
if year not in yearAvgs:
yearAvgs[year] = []
yearAvgs[year].append(val)
for k, v in yearAvgs.items():
avg = np.mean(v)
print ("Year = ",k,": Mean = ",avg)
Edit: If you're looking for a solution with pandas:
import pandas as pd
df = pd.read_csv(csvfile,names=['ID','Year','Value'])
df['Year'] = df['Year'].astype(int)
df.groupby(['Year']).mean()
I have a data set of rain fall in half hour intervals. I want to sum up the rainfall for each day and keep track of how many data points are summed per day to account for data gaps. Then I want to create a new file with a column for the date, a column for the rainfall, and a column for how many data points were available to sum for each day.
daily sum is my function that is trying to do this, get data is my function for extracting the data.
def get_data(avrains):
print('opening{}'.format(avrains))
with open(avrains, 'r') as rfile:
header = rfile.readline()
dates = []
rainfalls = []
for line in rfile:
line = (line.strip())
row = line.split(',')
d = datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
r = row[-1]
dates.append(d)
rainfalls.append(float(r))
data = zip(dates, rainfalls)
data = sorted(data)
return (data)
def dailysum(rains):
day_date = []
rain_sum = []
for i in rains:
dayi = i[0]
rainsi = i[1]
for i in dayi:
try:
if dayi[i]== dayi[i+1]:
s= rains[i]+rains[i+1]
rain_sum.append(float(s))
except:
pass
day_date.append(dayi[i])
There's a lot of ways to solve this, but I'll try to stay as close to your existing code as I can:
def get_data(avrains):
"""
opens the file specified in avrains and returns a dictionary
keyed by date, containing a 2-tuple of the total rainfall and
the count of data points, like so:
{
date(2018, 11, 1) : (0.25, 6),
date(2018, 11, 2) : (0.00, 5),
}
"""
print('opening{}'.format(avrains))
rainfall_totals = dict()
with open(avrains, 'r') as rfile:
header = rfile.readline()
for line in rfile:
line = (line.strip())
row = line.split(',')
d = datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
r = row[-1]
try:
daily_rainfall, daily_count = rainfalls[d]
daily_rainfall += r
daily_count += 1
rainfalls[d] = (daily_rainfall, daily_count)
except KeyError:
# if we don't find that date in rainfalls, add it
rainfalls[d] = (r, 1)
return rainfalls
Now when you call get_data("/path/to/file"), you'll get back a dictionary. You can spit out the values with some thing like this:
foo = get_data("/path/to/file")
for (measure_date, (rainfall, observations)) in foo.items():
print measure_date, rainfall, observations
(I will leave the formatting of the date, and any sorting or file-writing as an exercise :) )
Hello I am trying to take a CSV file and iterate over each customers data. To explain, each customer has data for 12 months. I want to analyze their yearly data, save the correlations of this data to a new list and loop this until all customers have been analyzed.
For instance here is what a customers data might look like (simplified case):
I have been able to get this to work to generate correlations in a CSV of one customers data. However, there are thousands of customers in my datasheet. I want to use a nested for loop to get all of the correlation values for each customer into a list/array. The list would have a row of a specific customer's correlations then the next row would be the next customer.
Here is my current code:
import numpy
from numpy import genfromtxt
overalldata = genfromtxt('C:\Users\User V\Desktop\CUSTDATA.csv', delimiter=',')
emptylist = []
overalldatasubtract = overalldata[13::]
#This is where I try to use the four loop to go through all the customers. I don't know if len will give me all the rows or the number of columns.
for x in range(0,len(overalldata),11):
for x in range(0,13,1):
cust_months = overalldata[0:x,1]
cust_balancenormal = overalldata[0:x,16]
cust_demo_one = overalldata[0:x,2]
cust_demo_two = overalldata[0:x,3]
num_acct_A = overalldata[0:x,4]
num_acct_B = overalldata[0:x,5]
#Correlation Calculations
demo_one_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_one)[1,0]
demo_two_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_two)[1,0]
demo_one_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_one)[1,0]
demo_one_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_one)[1,0]
demo_two_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_two)[1,0]
demo_two_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_two)[1,0]
result_correlation = [demo_one_corr_balance, demo_two_corr_balance, demo_one_corr_acct_a, demo_one_corr_acct_b, demo_two_corr_acct_a, demo_two_corr_acct_b]
result_correlation_combined = emptylist.append(result_correlation)
#This is where I try to delete the rows I have already analyzed.
overalldata = overalldata[11**x::]
print result_correlation_combined
print overalldatasubtract
It seemed that my subtraction method was working, but when I tried it with my larger data set, I realized my method is totally wrong.
Would you do this a different way? I think that it can work, but I cannot find my mistake.
You use the same variable x for both loops. In the second loop x goes from 0 to 12 whatever the customer, and since you set the line number only with x you're stuck on the first customer.
Your double loop should rather look like this :
# loop over the customers
for x_customer in range(0,len(overalldata),12):
# loop over the months
for x_month in range(0,12,1):
# line number: x
x = x_customer*12 + x_month
...
I changed the bounds and steps of the loops because :
loop 1: there are 12 months so 12 lines per customer -> step = 12
loop 2: there are 12 months, so month number ranges from 0 to 11 -> range(0,12,1)
this is how I solved the problem: It was a problem with the placement of my for loops. A simple indentation problem. Thank you for the help to above poster.
for x_customer in range(0,len(overalldata),12):
for x in range(0,13,1):
cust_months = overalldata[0:x,1]
cust_balancenormal = overalldata[0:x,16]
cust_demo_one = overalldata[0:x,2]
cust_demo_two = overalldata[0:x,3]
num_acct_A = overalldata[0:x,4]
num_acct_B = overalldata[0:x,5]
#Correlation Calculations
demo_one_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_one)[1,0]
demo_two_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_two)[1,0]
demo_one_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_one)[1,0]
demo_one_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_one)[1,0]
demo_two_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_two)[1,0]
demo_two_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_two)[1,0]
result_correlation = [(demo_one_corr_balance),(demo_two_corr_balance),(demo_one_corr_acct_a),(demo_one_corr_acct_b),(demo_two_corr_acct_a),(demo_two_corr_acct_b)]
numpy.savetxt('correlationoutput.csv', (result_correlation))
result_correlation_combined = emptylist.append([result_correlation])
cust_delete_list = [0,(x_customer),1]
overalldata = numpy.delete(overalldata, (cust_delete_list), axis=0)