I am trying to norm data in a 2D array, so that all numerical data points becomes normative elements of a chosen column. Allow me to elaborate:
I have data from countries and years, showing the population of each country per year. So the first row is years, and the first column is countries, and all the rest of the data is population. The data starts in 2019. I want to use 2019 as a base year and then norm the following years so that they become (+ or - integers * 100) to give increases and drops.
So far my code just gives me 1.0 in that first column, since I am dividing each elements with itself along the array! Only the first column after the "countries" column should have 1, since it is that column's elements divided by themselves. But the other columns should give a value n showing a delta with the 2019 values. Instead, they are just giving seemingly the data that is already in the 2D list. How do I use data from the column "2019" as the divider, so that each year's population becomes a normed integer?
I went ahead and re-printed the list, so that the data is the same in each row. This would mean that if the code was working correctly, then all the data in the array should also be 1.0. Since it is not, there is a problem.
Here is my code so far:
list_1 = [['countries', 2019, 2020, 2025],['aruba', 2,2,2],['barbados', 2,2,2],['japan', 2,2,2]]
for row in range(1,len(list_1)):
for column in range(1,len(list_1[row])):
list_1[row][column] = (list_1[row][column])/list_1[row][1]
print(f'{list_1[row][column]:15}', end='')
print()
Thank you for your insight! And I am not allowed to import any modules for this assignment.
Related
I have a pandas dataframe that you can see in the screenshot. The dataframe has a time resolution of 15 minutes (it is generation data). I would like to reduce this time resolution to 1 hour meaning that I should take every 4th row and the value in every 4th row should be the anverage values of the last 4 rows (including this one). So it should be a rolling average with non-overlapping horizons.
I tried the following for one column (wind offshore):
df_generation = pd.read_csv("C:/Users/Desktop/Data/generation_data.csv", sep =",")
df_generation_2 = df_generation
df_generation_2['Wind Offshore Average'] = df_generation_2['Wind Offshore'].rolling(4).mean()
But this is not what I really want. As you can see in the screenshot, my code just created a further column with the average of the last 4th entries for every timeslot. Here the rolling average has overlapping horizons. What I want is to have a new dataframe that only has an entry after every hour (after 4 timslots of the original array). Do you have an idea how I can do that? I'd appreciate every comment.
From looking at your Index it looks like the .resample method is what you are looking for (with many examples for specific uses): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
as in
new = df_generation['Wind Offshore'].resample('1H').mean()
I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)
I have data arrays (361x361) for Jan, Feb, March, Apr, Oct, Nov and Dec for a given year.
So far I've been storing them in individual netcdfs for every month in the year (e.g. 03.nc, 10.nc)
I'd like to combine all months into one netcdf, so that I can do something like:
march_data = data.sel(month='03')
or alternatively data.sel(month=3))
So far I've only been able to stack the monthly data in a 361x361x7 array and it's unhelpfully indexed so that to get March data you need to do data[:,:,2] and to get October it's data[:,:,4]. Clearly 2 & 4 do not intuitively correspond to the months of March and October. This is in part because python is indexed from zero and in part because I'm missing the summer months. I could put nan fields in for the missing months, but that wouldn't solve the index-0 issue.
My attempt so far:
data = xarray.Dataset( data_vars={'ice_type':(['x','y','time'],year_array),},
coords={'lon':(['x','y'],lon_target),
'lat':(['x','y'],lat_target),
'month_number':(['time'],month_int)})
Here year_array is a 361x361x7 numpy array, and month_int is a list that maps the third index of year_array to the month number: [1,2,3,4,10,11,12].
When I try to get Oct data with oct = data.sel(month_number=10) it throws an error.
On a side note, I'm aware that there's possibly a solution to be found here, but to be honest I don't understand how it works. My confusion is mostly based around how they use 'time' both as a dictionary key and list of times at the same time.
I think I've written a helper function to do something just like that:
def combine_new_ds_dim(ds_dict, new_dim_name):
"""
Combines a dictionary of datasets along a new dimension using dictionary keys
as the new coordinates.
Parameters
----------
ds_dict : dict
Dictionary of xarray Datasets or dataArrays
new_dim_name : str
The name of the newly created dimension
Returns
-------
xarray.Dataset
Merged Dataset or DataArray
Raises
------
ValueError
If the values of the input dictionary were of an unrecognized type
"""
expanded_dss = []
for k, v in ds_dict.items():
expanded_dss.append(v.expand_dims(new_dim_name))
expanded_dss[-1][new_dim_name] = [k]
new_ds = xr.concat(expanded_dss, new_dim_name)
return new_ds
If you have all of the data in individual netcdfs then you should be able to import them into individual dataArray's. Assuming you've done that, you could then do
month_das = {
1: january_da,
2: february_da,
...
12: december_da
}
year_data = combine_new_ds_dim(month_das, 'month')
which would be the concatenation of all of the data along the new dimension month with the desired coordinates. I think the main loop of the function is easy enough to separate if you want to use that alone.
EDIT:
For anyone looking at this in the future, there's a much easier way of doing this with builtin xarray functions. You can just concatenate along a new dimension
year_data = xr.concat([january_da, february_da, ..., december_da], dim="month")
which will create a new dataArray with the constituent arrays concatenated along a new dimension, but without coordinates on that dimension. To add coordinates,
year_data["month"] = [1, 2, ..., 12]
at which point year_data will be concatenated along the new dimension "month" and will have the desired coordinates along that dimension.
I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.
If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.
I have a 4D array that has two spatial directions, a month column and a year column. It gives a scalar value at each spatial point and for each month. I want to reshape this array to be 3D so that instead of the value being defined as x, y, month, year, it is just defined as x, y, month, where now the month column runs from 1-36 say with no year column instead of 1-12 with a year column of 1-3. How would I do this in Python? Thanks!
The basic approach is to code the new column something like:
new_month = old_month + 12*(old_year-1)
This translates your 3-year scale into a continuum of months numbered 1-36. I can't show you how to code this, because (1) you haven't given us reference code, so I have little idea how your 4D array is structured; (2) As I hope you've read in the help documentation, we're not a coding service.
Add a new column with values of (year-1)*12+month then discard or ignore your year and month columns. Details depend on exactly how your data is currently structured if it is a numpy array this would be 2 lines of code!