I can't work out why the dataframe "newTimeDF" I am adding to is empty at the end of the for loop:
timeZonesDF = pd.DataFrame{"timeZoneDate": [2018-03-11, 2018-11-04]}
newTimeDF = pd.DataFrame(columns = ["startDate", "endDate"])
for yearRow, yearData in timeZonesDF.groupby(pd.Grouper(freq="A")):
DST_start = pd.to_datetime(yearData.iloc[0]["timeZoneDate"])
DST_end = pd.to_datetime(yearData.iloc[-1]["timeZoneDate"])
newTimeDF["startDate"] = DST_start
newTimeDF["endDate"] = DST_end
continue
Can someone please point out what I am missing, is there something about groupby for-loops which is different?
The code you have here:
newTimeDF["startDate"] = DST_start
newTimeDF["endDate"] = DST_end
is setting the startDate column equal to DST_start for all rows and the endDate column equal to DST_end for all rows. because at the time of running this code your dataframe has no rows, nothing is changed in your final product.
What you could do is create a dictionary from your two values like so:
tempdic = {"startDate" : DST_start, "endDate" : DST_end}
Then append that dictionary to your dataframe to add a row.
newTimeDF.append(tempdic, ignore_index=True)
Making your code look something like this
for yearRow, yearData in timeZonesDF.groupby(pd.Grouper(freq="A")):
DST_start = pd.to_datetime(yearData.iloc[0]["timeZoneDate"])
DST_end = pd.to_datetime(yearData.iloc[-1]["timeZoneDate"])
tempdic = {"startDate" : DST_start, "endDate" : DST_end}
newTimeDF = newTimeDF.append(tempdic, ignore_index=True)
Related
I have a time series that looks something like these
fechas= pd.Series(pd.date_range(start='2015-01-01', end='2020-12-01', freq='H'))
data=pd.Series(range(len(fechas)))
df=pd.DataFrame({'Date':fechas, 'Data':data})
What I need to do is the sum of every day and group by year, what I did and works is
df['year']=pd.DatetimeIndex(df['Date']).year
df['month']=pd.DatetimeIndex(df['Date']).month
df['day']=pd.DatetimeIndex(df['Date']).day
df.groupby(['year','month','day'])['Data'].sum().reset_index()
But what I need is to have the years in the columns to look something like this
res=pd.DataFrame(columns=['dd-mm','2015','2016','2017','2018','2019','2020']
This might be what you need:
df = pd.DataFrame({'Date':fechas, 'Data':data})
df = df.groupby(pd.DatetimeIndex(df["Date"]).date).sum()
df.index = pd.to_datetime(df.index)
df["dd-mm"] = df.index.strftime("%d-%m")
output = pd.DataFrame(index=df["dd-mm"].unique())
for yr in range(2015, 2021):
temp = df[df.index.year==yr]
temp = temp.set_index("dd-mm")
output[yr] = temp
output = output.reset_index() #if you want to have dd-mm as a column instead of the index
With this code:
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i,snlist in list(zip(range(1,13),sn)):
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist, skiprows=range(6))
I get this error:
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist,
skiprows=range(6))
^ SyntaxError: cannot assign to function call
I can't understand the error and how solve. What's the problem?
df+str(i) also return error
i want to make result as:
df1 = pd.read_excel.. list1...
df2 = pd.read_excel... list2....
You can't assign the result of df.read_excel to 'df{}'.format(str(i)) -- which is a string that looks like "df0", "df1", "df2" etc. That is why you get this error message. The error message is probably confusing since its treating this as assignment to a "function call".
It seems like you want a list or a dictionary of DataFrames instead.
To do this, assign the result of df.read_excel to a variable, e.g. df and then append that to a list, or add it to a dictionary of DataFrames.
As a list:
dataframes = []
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes.append(df)
As a dictionary:
dataframes = {}
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes[i] = df
In both cases, you can access the DataFrames by indexing like this:
for i in range(len(dataframes)):
print(dataframes[i])
# Note indexes will start at 0 here instead of 1
# You may want to change your `range` above to start at 0
Or more simply:
for df in dataframes:
print(df)
In the case of the dictionary, you'd probably want:
for i, df in dataframes.items():
print(i, df)
# Here, `i` is the key and `df` is the actual DataFrame
If you really do want df1, df2 etc as the keys, then do this instead:
dataframes[f'df{i}'] = df
I have created a dataframe as shown:
idx = pd.MultiIndex.from_product([['batch1', 'batch2','batch3', 'batch4', 'batch5'], ['quiz1', 'quiz2']])
cols=['noofpresent', 'lesserthan50', 'between50and60', 'between60and70', 'between70and80', 'greaterthan80']
statdf = pd.DataFrame('-', idx, cols)
statdf
statdf.loc['quiz1', 'noofpresent'] = qdf1.b4ispresent.count()
statdf.loc['quiz2', 'noofpresent'] = qdf2.b4ispresent.count()
statdf.noopresent = qdf1.b4ispresent.count()
statdf.noopresent = qdf2.b4ispresent.count()
statdf
Then I made some calculations. I now want to append that specific calculation of the figures '50' and '53' in column 'noofpresent' in 'batch4', 'quiz1' and 'quiz2' respectively. But instead this happened...
How can I insert my data into the right place?
you can index it like this.
statdf.loc['batch4','quiz1']['noofpresent'] = qdf1.b4ispresent.count()
statdf.loc['batch4','quiz2']['noofpresent'] =qdf2.b4ispresent.count()
I'm trying to write a for loop where I can subset a dataframe for each unique ID and create a new column. In my example, I want to create a new balance based on the ID, balance and initial amount. My idea was to loop through each group of ID's, take that subset and follow it up with some if/if else statements. In the iteration I want the loop to look at all of the unique ID's, for example when I loop though df["ID"] == 2, there should be 7 rows, since their balances are all related to each other. This is what my dataframe would look like:
df = pd.DataFrame(
{"ID" : [2,2,2,2,2,2,2,3,4,4,4],
"Initial amount":
[3250,10800,6750,12060,8040,4810,12200,13000,10700,12000,27000],
"Balance": [0,0,0,0,0,0,0,2617,19250,19250,19250], "expected output":
[0,0,0,0,0,0,0,2617,10720,8530,0]})
My current code looks like this, but I feel like i'm headed towards the wrong direction. Thanks!
unique_ids = list(df["ID"].unique())
new_output = []
for i in range(len(unique_ids)):
this_id = unique_ids[i]
subset = df.loc[df["ID"] == this_id,:]
for j in range(len(subset)):
this_bal = subset["Balance"]
this_amt = subset["Initial amount"]
if j == 0:
this_output = np.where(this_bal >= this_amt, this_amt, this_bal)
new_output.append(this_output)
elif this_bal - sum(this_output) >= this_amt:
this_output = this_amt
new_output.append(this_output)
else:
this_output = this_bal - sum(this_output)
new_output.append(this_output)
Any suggestions would be greatly appreciated!
I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.