i'm a new in the programming world so, i have a question about a dataframe and iteration problem.
(i'm using python)
i have the follow:
this is my dataframe
in the first column (x) i have the date and in the second column (y), i have some values (the total shape is (119,2))
my question is:
if i want to select the date "2020-12-01" and sum the 14 previous values and asing this result to this date and do the same for the next date, how can i do that ?
(i put the blue color over the date, and red over the values that i want to add to blue value, in the previous image )
i tried to do the follow:
final_value = 0
for i in data["col_name"]:
final_value = data["col_name"].iloc[i:14].sum()
but the output is 0.
so, can someone give me some ideas to solve it problem?
thanks to read me
Convert x column time to datetime
df['x'] = pd.to_datetime(df['x'], format='%Y-%m-%d')
Use rolling to select 14 days to add up
df.rolling("14D", on="x").sum()['y']
Try this on your full dataset
Related
I have the following dataframe called df1 that contains data for a number of regions in the column NUTS_ID:
The index, called Date has all the days of 2010. That is, for each code in NUTS_ID I have a day of 2010 (all days of the year for AT1, AT2and so on). I created a list containing the dates corresponding to non-workdays and I want to add a column that with 0 for non-workdays and 1 for workdays.
For this, I simply used a for loop that checks day by day if it's in the workday list I created:
for day in df1.index:
if day not in workdays_list:
df1.loc[day,'Workday'] = 0 # Assigning 0 to to non-workdays
else:
df1.loc[day,'Workday'] = 1 # Assigning 1 to workdays
This works well enough if the dataset is not big. But with some of the datasets I'm processing this takes a very long time. I would like to ask for ideas in order to do the process faster and more efficient. Thank you in advance for your input.
EDIT: One of the things I have thought is that maybe a groupby could be helpful, but I don't know if that is correct.
You can use np.where with isin to check if your Date (i.e. your index) is in the list you created:
import numpy as np
df1['Workday'] = np.where(df1.index.isin(workdays_list),1,0)
I can't reproduce your dataset, but something along those lines should work.
From the dataframe below:
I would like to group column 'datum' by date 01-01-2019 and so on. and get an average at the same time on column 'PM10_gemiddelde'.
So now all 01-01-2019 (24 times) is on hour base and i need it combined to 1 and get the average on column ' PM10_gemiddelde' at the same time. See picture for the data.
besides that, PM10_gemiddelde has also negative data. How can i erase that data in python easily?
Thank you!
ps. im new with python
What you are trying to do can be achieve by:
data[['datum','PM10_gemiddelde']].loc[data['PM10_gemiddelde'] > 0 ].groupby(['datum']).mean()
You can create a new column with the average of PM10_gemiddelde using groupby along with transform. Try the following:
Assuming your dataframe is called df, start first by removing the negative data:
new_df = df[df['PM10_gemiddelde'] > 0]
Then, you can create a new column that contains the average value for every date:
new_df['avg_col'] = new_df.groupby('datum')['PM10_gemiddelde'].transform('mean')
I have a pandas DataFrame with data from an icecream freezer. Several columns describe the different temperatures in the system as well as some other things.
One column, named 'Defrost status', tells me when the freezer was defreezing to remove abundant ice with boolean values.
Those 'defrosts' is what I am interested in, so I added another column named "around_defrost". This column currently only has NaN values, but I want to change them to 'True' whenever there is a defrost within 30 minutes away from that specific row in the dataframe.
The data is recorded every minute so 30 minutes would mean 30 rows before a defrost and 30 rows behind it need to be set to 'True'
I have tried to do this with itterrows, ittertuples and by playing with the indexes as seen in the figure below but nu success so far. If anyone has a good idea of how this would could be done, I'd really appreciate it!
enter image description here
You need to use dataframe.rolling:
df = df.sort_values("Time") #sort by Time
df['around_defrost'] = df['Defrost status'].rolling(60, center=True, min_periods = 0).apply(
lambda x: True if True in x else False, raw=True)
EDIT: you may need rolling(61, center=True) since you want to consider the row in question AND 30 before and after.
I calculate number of quarters gap between two dates. Now, I want to test if the number of quarters gap is bigger than 2.
Thank you for your comments!
I'm actually running a code from WRDS (Wharton Research Data Services). Below, fst_vint is a DataFrame with two date variables, rdate and lag_rdate. First line seems to convert them to quarter variables (e.g., 9/8/2019 to 2019Q1), and then take differences between them, storing it in a new column qtr.
fst_vint.qtr >= 2 creates a problem, because the former is a QuarterEnd object, while the latter is an integer. How do I deal with this problem?
fst_vint['qtr'] = (fst_vint['rdate'].dt.to_period('Q')-\
fst_vint['lag_rdate'].dt.to_period('Q'))
# label first_report flag
fst_vint['first_report'] = ((fst_vint.qtr.isnull()) | (fst_vint.qtr>=2))
Using .diff() when column is converted to integer with .astype(int) results in the desired answer. So the code in your case would be:
fst_vint['qtr'] = fst_vint['rdate'].astype(int).diff()
I have a csv dataset where I want to calculate the average for all rows The average is calculated from data start at column 14. This is what I have done so far but I am still not getting the average value. Can someone help me with this?
I am also getting confused with this Axis thing.
file = ('dataset.csv')
df = pd.read_csv(file)
d_col = df[df.columns[14:]]
mean_value = d_col['mean'] = d_col.mean(axis=1, skipna=True, numeric_only=True)
print mean_value
d_col.to_csv('out.csv')
It's a very strange indexing syntax you're using. A clearer way should be:
d_col = df.iloc[:, 14:]
axis = 0 means taking the average by column, and axis = 1 by the row, which you seem to be doing correctly. I'm not sure what exactly you mean by not getting the average. The d_col should contain your original data and a new column named "mean" containing the result.
Because you don't provide sample data see the following sample code. The first column is some text column that should be ignored, whereas the other columns in the DataFrame df are the ones that should be used to calculate the mean value.
# prepare some dataset
letters = 'abcdefghijklmnopqrstuvwxyz'
rows = 10
col1 = np.array(list(letters))[np.random.permutation(len(letters))[:rows]]
df = pd.concat([pd.DataFrame(col1), pd.DataFrame(np.random.randn(rows, 10))], axis=1)
result = df.iloc[:, 1:].mean(axis=1)
The result then looks like this:
0 0.693024
1 -0.356701
2 0.082385
3 -0.115622
4 -0.060414
5 0.104119
6 -0.435787
7 0.023327
8 -0.144272
9 0.363254
dtype: float64
/edit: Change answer above to use df.iloc instead of df[df.columns[...] as the latter makes problem in case two columns have the same name. Please mark peidaqi's answer as the correct one.
The issue lied here , I was saving d_col as the output csv file instead of mean_value . It's silly but i guess that's how you learn to pickup things. Thanks #peidaqi and others for your explanation.