Efficient way of converting year_week to datetime in pandas

Efficient way of converting year_week to datetime in pandas - python

I have a pandas df with two columns year and week_number.
df = pd.DataFrame({'year': [2019, 2020, 2021, 2022], 'week_number':[3,12,38,42]})
df
year week_number
0 2019 3
1 2020 12
2 2021 38
3 2022 42
I know I can apply something like following to each row and convert them to datetime values however, I want to know if there is more efficient way to do this for the big dataframes and store the results in third column?
import datetime
single_day = "2013-26"
converted_date = datetime.datetime.strptime(single_day + '-1', "%Y-%W-%w")
print(converted_date)

I wouldn't say your way is inefficient, but if you want a fully vectorized way, without having to import another library, and which appends your dataframe, this might be what you're looking for
import pandas as pd
df = pd.DataFrame({'year': [2019, 2020, 2021, 2022], 'week_number':[3,12,38,42]})
df['date'] = pd.to_datetime((df['year']*100+df['week_number']).astype(str) + '0', format='%Y%W%w')
df

If you are on Python >= 3.8, use datetime.date.fromisocalendar. Also works for datetime.
# 11. May 2022 is a Wednsesday in the 19h week
>>> date.fromisocalendar(2022, 19, 3)
datetime.date(2022, 5, 11)
As new Column:
df['date'] = df[['year', 'week_number']].apply(lambda args: date.fromisocalendar(args[0], args[1], 1), axis=1)

Use apply to loop over rows (axis=1) and a lambda function that concatenates the two columns as a string and then do exactly the thing you did it above :) Perhaps this wasn't the answer you were looking for thou, since you looking for the most efficent solution. However, this does the job!
df['convert_date']=df.apply(lambda x: datetime.strptime(f"{x.year}-{x.week_number}" + '-1', "%Y-%W-%w"), axis=1)

Related

A simple way of selecting the previous row in a column and performing an operation?

I'm trying to create a forecast which takes the previous day's 'Forecast' total and adds it to the current day's 'Appt'. Something which is straightforward in Excel but I'm struggling in pandas. At the moment all I can get in pandas using .loc is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,0,0,0,0]
})
What I'm looking for it to do is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,47,52,56,69]
})
E.g. 'Forecast' total on the 1st December is 37. On the 2nd December the value in the 'Appt' column in 10. I want it to select 37 and + 10, then put this in the 'Forecast' column for the 2nd December. Then iterate over the rest of the column.
I've tied using .loc() with the index, and experimented with .shift() but neither seem to work for what I'd like. Also looked into .rolling() but I think that's not appropriate either.
I'm sure there must be a simple way to do this?
Apologies, the original df has 'Date' as a datetime column.

You can use mask and cumsum:
df['Forecast'] = df['Forecast'].mask(df['Forecast'].eq(0), df['Appt']).cumsum()
# or
df['Forecast'] = np.where(df['Forecast'].eq(0), df['Appt'], df['Forecast']).cumsum()
Output:
Date Appt Forecast
0 2022-12-01 12 37
1 2022-12-01 10 47
2 2022-12-01 5 52
3 2022-12-01 4 56
4 2022-12-01 13 69

You have to make sure that your column has datetime/date type, then you may filter df like this:
# previous code&imports
yesterday = datetime.now().date() - timedelta(days=1)
df[df["date"] == yesterday]["your_column"].sum()

Extracting year from dataset in python - year appears a decimal point

Trying to extract year from dataset in python
df["YYYY"] = pd.DatetimeIndex(df["Date"]).year
year appears as decimal point in the new column.
YYYY
2001.0
2002.0
2015.0
2022.0
How to just have year appear with no decimal points?

You likely have null values in you input resulting in NaNs and a float type for your column.
No missing values:
pd.DatetimeIndex(['2022-01-01']).year
Int64Index([2022], dtype='int64')
Missing values:
pd.DatetimeIndex(['2022-01-01', '']).year
Float64Index([2022.0, nan], dtype='float64')
I suggest to use pandas.to_datetime combined with convert_dtypes:
pd.to_datetime(pd.Series(['2022-01-01', ''])).dt.year.convert_dtypes()
0 2022
1 <NA>
dtype: Int64
Or to extract directly the year from the initial strings. But for that we would need a sample of the input.

sample program for your problem
import pandas as pd
df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000']})
df['date'] = pd.to_datetime(df['date'])
df['year'] = pd.DatetimeIndex(df['date']).year
print(df['year'])
pandas takes care of date by itself
if not we can directly specify as
df["date_feild"] = pd.to_datetime(df["date_feild"])
hope it will make things clear to you.
if not can you specify the df samples

How to use pandas to read a .csv file and skip rows based on keywords?

Right now, I am parsing my file by using skiprows, but the skiprows is unreliable because the data can change. I want to skip rows based on keywords such as "Ferrari, Apple, Baseball". How can I accomplish this? Could you please provide examples?
EDIT: If possible, another solution that could work better for me is to skip n rows in the beginning and then stop reading values in the columns after a BLANK entry is reached. Is this possible?
import pandas as pd
import pyodbc
df = pd.read_csv(r'C://mycsvfile.csv', skiprows=[3,108,109,110,111,112,114,115,116,118])
"""
Step 2 Specify columns we want to import
"""
columns = ['Run Date','Action','Symbol','Security Description','Security Type','Quantity','Price ($)','Commission ($)','Fees ($)','Accrued Interest ($)','Amount ($)','Settlement Date']
df_data = df[columns]
records = df_data.values.tolist()
print(df)

You can try parse every column and try to find keyword which you need and delete row where your keyword contains.
df = df[df["Run Date"].str.contains("Ferrari") == False]
Make it loop.

There's a few ways to do it. Here's my solution.
Make all keywords lower case to eliminate case sensitive
Define which columns you need to check for the keywords (I could alter this to check all columns if needed)
Concatenate the columns to check all columns at once as opposed to iterating through each
Make the cells all lower case (see 1)
Keep rows that do not contain a keyword
Code:
import pandas as pd
df = pd.DataFrame([['I love apples.', '', 1, 'Jan 1, 2021'],
['Apple is tasty.', 'Ferrari', 2, 'Jan 2, 2022'],
['This does not contain a keyword', 'Nor does this.', 15, 'Mar 1, 2021'],
['This row is ok', 'But it has baseball in it.', 34, 'Feb 1, 2021']], columns = ['A','B','Value','Date'])
keywords = ['Ferrari', 'Apple', 'Baseball']
keywords = '|'.join(keywords)
keywords = keywords.lower()
columns_to_check = ['A','B', 'Value']
df = df[~df[columns_to_check].astype(str).sum(1).str.lower().str.contains(keywords)]
Input:
print(df.to_string())
A B Value Date
0 I love apples. 1 Jan 1, 2021
1 Apple is tasty. Ferrari 2 Jan 2, 2022
2 This does not contain a keyword Nor does this. 15 Mar 1, 2021
3 This row is ok But it has baseball in it. 34 Feb 1, 2021
Output:
print(df.to_string())
A B Value Date
2 This does not contain a keyword Nor does this. 15 Mar 1, 2021

Pandas dataframe.set_index() deletes previous index and column

I just came across a strange phenomenon with Pandas DataFrames, when setting index using DataFrame.set_index('some_index') the old column that was also an index is deleted! Here is an example:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
>>> df_mn
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
Now I change the index to year:
df_mn.set_index('year')
sale
year
2012 55
2014 40
2013 84
2014 31
.. and the month column was removed with the index. This is vary irritating because I just wanted to swap the DataFrame index.
Is there a way to not have the previous column that was an index from being deleted? Maybe through something like: DataFrame.set_index('new_index',delete_previous_index=False)
Thanks for any advice

You can do the following
>>> df_mn.reset_index().set_index('year')
month sale
year
2012 1 55
2014 4 40
2013 7 84
2014 10 31

the solution I found to reatain a previous columns is to set drop=False
dataframe.set_index('some_column',drop=False). This is not the perfect answer but it works!

No, in such cases you have to save your previous column, like shown
below:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
df_mn['month'] = df_mn.index #Save it as another column, and then run set_index with year column as value.
df_mn.set_index('year')
Besides you are using a duplicate dataframe df_mn , so the dataframe df remains unchanged you can use it again.
And also if you aren't setting the
inplace argument for set_index to True
df_mn won't have changed even after you call set_index() on it.
Also, like the other answer you can always use reset_index().

Reindexing only valid with uniquely valued Index objects: Pandas DataFrame Panel

I am trying to average each cell of a bunch of .csv files to export as a single averaged .csv file using Pandas.
I have no problems, creating the dataframe itself, but when I try to turn it into a Panel (i.e. panel=pd.Panel(dataFrame)), I get the error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects pandas pd.panel
An example of what each csv file looks like:
Year, Month, Day, Latitude, Longitude, Value1, Value 2
2010, 06, 01, 23, 97, 1, 3.5
2010, 06, 01, 24, 97, 5, 8.2
2010, 06, 01, 25, 97, 6, 4.6
2010, 06, 01, 26, 97, 4, 2.0
Each .csv file is from gridded data so they have the same number of rows and columns, as well as some no data values (given a value of -999.9), which my code snippet below addresses.
The code that I have so far to do this is:
june=[]
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
june.append(csv1)
dfs={i: pd.DataFrame.from_csv(i) for i in june}
panel=pd.Panel(dfs)
panels=panel.replace(-999.9,np.NaN)
dfs_mean=panels.mean(axis=0)
I have seen questions where the user is getting the same error, but the solutions for those questions doesn't seem to work with my issue. Any help fixing this, or ideas for a better approach would be greatly appreciated.

pd.Panel has been deprecated
Use pd.concat with a dictionary comprehension and take the mean over level 1.
df1 = pd.concat({f: pd.read_csv(f) for f in glob('meansample[0-9].csv')})
df1.mean(level=1)
Year Month Day Latitude Longitude Value1 Value 2
0 2010 6 1 23 97 1 3.5
1 2010 6 1 24 97 5 8.2
2 2010 6 1 25 97 6 4.6
3 2010 6 1 26 97 4 2.0

I have a suggestion to change the approach a bit. Instead of converting each DF into panel, just concat them into one big DF but for each one give a unique ID. After you can just do groupby by the ID and use mean() to get the result.
It would look similar to this:
import Pandas as pd
df = pd.DataFrame()
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
temp_df = pd.read_csv(csv1)
temp_df['df_id'] = csv1
df = pd.concat([df, temp_df])
df.replace(-999.9, np.nan)
df = df.groupby("df_id").mean()
I hope it helps somehow, if you still have any issues with that let me know.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way of converting year_week to datetime in pandas - python

Related

A simple way of selecting the previous row in a column and performing an operation?

Extracting year from dataset in python - year appears a decimal point

How to use pandas to read a .csv file and skip rows based on keywords?

Pandas dataframe.set_index() deletes previous index and column

Reindexing only valid with uniquely valued Index objects: Pandas DataFrame Panel

Categories

Resources