Python pandas dataframe column and index formatting issues - python

I am trying to output a pandas dataframe and I am only getting one column (PADD 5) instead of (PADD 1 through PADD 5). In addition, I cannot get the index to format in YYYY-MM-DD. I would appreciate if anyone knew how to output these two things. Thanks much!
# API Key from EIA
api_key = 'xxxxxxxxxxx'
# api_key = os.getenv("EIA_API_KEY")
# PADD Names to Label Columns
# Change to whatever column labels you want to use.
PADD_NAMES = ['PADD 1','PADD 2','PADD 3','PADD 4','PADD 5']
# Enter all your Series IDs here separated by commas
PADD_KEY = ['PET.MCRRIP12.M',
'PET.MCRRIP22.M',
'PET.MCRRIP32.M',
'PET.MCRRIP42.M',
'PET.MCRRIP52.M']
# Initialize list - this is the final list that you will store all the data from the json pull. Then you will use this list to concat into a pandas dataframe.
final_data = []
# Choose start and end dates
startDate = '2009-01-01'
endDate = '2023-01-01'
for i in range(len(PADD_KEY)):
url = 'https://api.eia.gov/series/?api_key=' + api_key + '&series_id=' + PADD_KEY[i]
r = requests.get(url)
json_data = r.json()
if r.status_code == 200:
print('Success!')
else:
print('Error')
print(json_data)
df = pd.DataFrame(json_data.get('series')[0].get('data'),
columns = ['Date', PADD_NAMES[i]])
df.set_index('Date', drop=True, inplace=True)
final_data.append(df)
# Combine all the data into one dataframe
crude = pd.concat(final_data, axis=1)
# Create date as datetype datatype
crude['Year'] = crude.index.astype(str).str[:4]
crude['Month'] = crude.index.astype(str).str[4:]
crude['Day'] = 1
crude['Date'] = pd.to_datetime(crude[['Year','Month','Day']])
crude.set_index('Date',drop=True,inplace=True)
crude.sort_index(inplace=True)
crude = crude[startDate:endDate]
crude = crude.iloc[:,:5]
df.head()
PADD 5
Date
202201 1996
202112 2071
202111 2125
202110 2128
202109 2232

Related

How to add a new row with new header information in same dataframe

I have written a code to retrieve JSON data from an URL. It works fine. I give the start and end date and it loops through the date range and appends everything to a dataframe.
The colums are populated with the JSON data sensor and its corresponding values, hence the column names are like sensor_1. When I request the data from the URL it sometimes happens that there are new sensors and the old ones are switched off and deliver no data anymore and often times the length of the columns change. In that case my code just adds new columns.
What I want is instead of new columns a new header in the ongoing dataframe.
What I currently get with my code:
datetime;sensor_1;sensor_2;sensor_3;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-01;23.2;43.5;45.2;NaN;NaN;NaN;NaN;NaN;
2023-01-02;13.2;33.5;55.2;NaN;NaN;NaN;NaN;NaN;
2023-01-03;26.2;23.5;76.2;NaN;NaN;NaN;NaN;NaN;
2023-01-04;NaN;NaN;NaN;75;12;75;93;123;
2023-01-05;NaN;NaN;NaN;23;31;24;15;136;
2023-01-06;NaN;NaN;NaN;79;12;96;65;72;
What I want:
datetime;sensor_1;sensor_2;sensor_3;
2023-01-01;23.2;43.5;45.2;
2023-01-02;13.2;33.5;55.2;
2023-01-03;26.2;23.5;76.2;
datetime;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-04;75;12;75;93;123;
2023-01-05;23;31;24;15;136;
2023-01-06;79;12;96;65;72;
My loop to retrieve the data:
start_date = datetime.datetime(2023,1,1,0,0)
end_date = datetime.datetime(2023,1,6,0,0)
sensor_data = pd.DataFrame()
while start_zeit < end_zeit:
q = 'url'
r = requests.get(q)
j = json.loads(r.text)
sub_data = pd.DataFrame()
if 'result' in j:
datetime = pd.to_datetime(np.array(j['result']['data'])[:,0])
sensors = np.array(j['result']['sensors'])
data = np.array(j['result']['data'])[:,1:]
df_new = pd.DataFrame(data, index=datetime, columns=sensors)
sub_data = pd.concat([sub_data, df_new])
sensor_data = pd.concat([sensor_data, sub_data])
start_date += timedelta(days=1)
if 2 DataFrames will do for you the you can simply split using the column names:
df1 = df[['datetime', 'sensor_1', 'sensor_2', 'sensor_3']]
df2 = df[['datetime', 'new_sensor_8', 'new-sensor_9', 'sensor_10', 'sensor_11']]
Note the [[ used.
and use .dropna() to lose the NaN rows

How to create new csv from list of csv's in dataframe

So I know my code isn't that close to right, but I am trying to loop through a list of csv's, line by line, to create a new csv where each line will list all csv's that met a condition. First column in all csv's is "date", I want to list the name of all csv's where data["entry"] > 3 on that date with date still being the 1st column.
Update: What I'm trying to do is for each csv, make a new list of each date the condition was met and on those days on the new csv append file_name to that row/rows.
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/SentdexTutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/SentdexTutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
complete_string = ' is complete'
listdrs_confirmation = [ x + complete_string for x in listdrs]
#print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
##print(listdr)
# Convert date to timestamp and make index
data.index = data["date"].apply(lambda x: pd.Timestamp(x))
data.drop("date", axis=1, inplace=True)
return data
##create new table and append data
data = data[data.Entry > 3]
for date in data.date:
new_table[date].append(file_path)
new_table_data = data.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())], columns=['date', 'table names'])
print(new_table_data)
I would do something like this. You need to modify the following snippet according to your needs.
import pandas as pd
from glob import glob
from collections import defaultdict
# create and save some random data
df1 = pd.DataFrame({'date':[1,2,3], 'entry':[4,3,2]})
df2 = pd.DataFrame({'date':[1,2,3], 'entry':[1,2,4]})
df3 = pd.DataFrame({'date':[1,2,3], 'entry':[3,1,5]})
df1.to_csv('table1.csv')
df2.to_csv('table2.csv')
df3.to_csv('table3.csv')
# read all the csv
tables = glob('*.csv')
new_table = defaultdict(list)
# create new table
for table in tables:
df = pd.read_csv(table)
df = df[df.entry > 2]
for date in df.date:
new_table[date].append(table)
new_table_df = pd.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())], columns=['date', 'table names'])
print (new_table_df)
date table names
0 1 table3.csv,table1.csv
1 2 table1.csv
2 3 table2.csv,table3.csv
Had some issues with the other code, here is the final solution I was able to come up with.
if 'Entry' in data:
##create new table and append data
data = data[data.Entry > 3]
if 'date' in data:
for date in data.date:
if date not in new_table:
new_table[date] = []
new_table[date].append(
pd.DataFrame({'FileName': [file_name], 'Entry': [int(data[data.date == date].Entry)]}))
new_table
elif 'Date' in data:
for date in data.Date:
if date not in new_table:
new_table[date] = []
new_table[date].append(
pd.DataFrame({'FileName': [file_name], 'Entry': [int(data[data.Date == date].Entry)]}))
# sorted(new_table, key=lambda x: x[0])
def find_max(tbl):
new_table_data = {}
for date in sorted(tbl.keys()):
merged_dt = pd.concat(tbl[date])
max_entry_v = max(list(merged_dt.Entry))
tbl_names = list(merged_dt[merged_dt.Entry == max_entry_v].FileName)
new_table_data[date] = tbl_names
return new_table_data
new_table_data = find_max(tbl=new_table)
#df = pd.DataFrame(new_table, columns =['date', 'tickers'])
#df.to_csv(input_path, index = False, header = True)
# find_max(new_table)
# new_table_data = pd.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())],
# columns=['date', 'table names'])
print(new_table_data)

How to stop apply() from changing order of columns?

I have a reproducible example, toy dataframe:
df = pd.DataFrame({'my_customers':['John','Foo'],'email':['email#gmail.com','othermail#yahoo.com'],'other_column':['yes','no']})
print(df)
my_customers email other_column
0 John email#gmail.com yes
1 Foo othermail#yahoo.com no
And I apply() a function to the rows, creating a new column inside the function:
def func(row):
# if this column is 'yes'
if row['other_column'] == 'yes':
# create a new column with 'Hello' in it
row['new_column'] = 'Hello'
# return to df
return row
# otherwise
else:
# just return the row
return row
I then apply the function to the df, and we can see that the order has been changed. The columns are now in alphabetical order. Is there any way to avoid this? I would like to keep it in the original order.
df = df.apply(func, axis = 1)
print(df)
email my_customers new_column other_column
0 email#gmail.com John Hello yes
1 othermail#yahoo.com Foo NaN no
Edited for clarification - the above code was too simple
input
df = pd.DataFrame({'my_customers':['John','Foo'],
'email':['email#gmail.com','othermail#yahoo.com'],
'api_status':['data found','no data found'],
'api_response':['huge json','huge json']})
my_customers email api_status api_response
0 John email#gmail.com data found huge json
1 Foo othermail#yahoo.com no data found huge json
Parsing the api_response. I need to create many new rows in the DF:
def api_parse(row):
# if we have response data
if row['api_response'] == huge json:
# get response for parsing
response_data = row['api_response']
"""Let's get associated URLS first"""
# if there's a URL section in the response
if 'urls' in response_data .keys():
# get all associated URLS into a list
urls = extract_values(response_data ['urls'], 'url')
row['Associated_Urls'] = urls
"""Get a list of jobs"""
if 'jobs' in response_data .keys():
# get all associated jobs and organizations into a list
titles = extract_values(person_data['jobs'], 'title')
organizations = extract_values(person_data['jobs'], 'organization')
counter = 1
# create a new column for each job
for pair in zip(titles,organizations):
row['Job'+'_'+str(counter)] = f'Title: {pair[0]}, Organization: {pair[1]}'
counter +=1
"""Get a list of education"""
if 'educations' in response_data .keys():
# get all degrees into list
degrees = extract_values(response_data ['educations'], 'display')
counter = 1
# create a new column for each degree
for edu in degrees:
row['education'+'_'+str(counter)] = edu
counter +=1
"""Get a list of social profiles from URLS we parsed earlier"""
facebook = [i for i in urls if 'facebook' in i] or [np.nan]
instagram = [i for i in urls if 'instagram' in i] or [np.nan]
linkedin = [i for i in urls if 'linkedin' in i] or [np.nan]
twitter = [i for i in urls if 'twitter' in i] or [np.nan]
amazon = [i for i in urls if 'amazon' in i] or [np.nan]
row['facebook'] = facebook
row['instagram'] = instagram
row['linkedin'] = linkedin
row['twitter'] = twitter
row['amazon'] = amazon
return row
elif row['api_Status'] == 'No Data Found':
# do nothing
return row
expected output:
my_customers email api_status api_response job_1 job_2 \
0 John email#gmail.com data found huge json xyz xyz2
1 Foo othermail#yahoo.com no data found huge json nan nan
education_1 facebook other api info
0 foo profile1 etc
1 nan nan nan
You could adjust the order of columns in your DataFrame after running the apply function. For example:
df = df.apply(func, axis = 1)
df = df[['my_customers', 'email', 'other_column', 'new_column']]
To reduce the amount of duplication (i.e. by having to retype all column names), you could get the existing set of columns before calling the apply function:
columns = list(df.columns)
df = df.apply(func, axis = 1)
df = df[columns + ['new_column']]
Update based on the author's edits to the original question. Whilst I'm not sure if the data structure chosen (storing API results in a Data Frame) is the best option, one simple solution could be to extract the new columns after calling the apply functions.
# Store the existing columns before calling apply
existing_columns = list(df.columns)
df = df.apply(func, axis = 1)
all_columns = list(df.columns)
new_columns = [column for column in all_columns if column not in existing_columns]
df = df[columns + new_columns]
For performance optimisations, you could store the existing columns in a set instead of a list which will yield lookups in constant time due to the hashed nature of a set data structure in Python. This would change existing_columns = list(df.columns) to existing_columns = set(df.columns).
Finally, as #Parfait very kindly points out in their comment, the code above may raise some depreciation warnings. Using pandas.DataFrame.reindex instead of df = df[columns + new_columns] will make the warnings disappear:
new_columns_order = [columns + new_columns]
df = df.reindex(columns=new_columns_order)
That occurs because you don't assign a value to the new column if row["other_column"] != 'yes'. Just try this:
def func(row):
if row['other_column'] == 'yes':
row['new_column'] = 'Hello'
return row
else:
row['new_column'] = ''
return row
df.apply(func, axis = 1)
You can choose the value for row["new_column"] == 'no' to be whatever. I just left it blank.

How to fill missing date in timeSeries

Here's what my data looks like:
There are daily records, except for a gap from 2017-06-12 to 2017-06-16.
df2['timestamp'] = pd.to_datetime(df['timestamp'])
df2['timestamp'] = df2['timestamp'].map(lambda x:
datetime.datetime.strftime(x,'%Y-%m-%d'))
df2 = df2.convert_objects(convert_numeric = True)
df2 = df2.groupby('timestamp', as_index = False).sum()
I need to fill this missing gap and others with values for all fields (e.g. timestamp, temperature, humidity, light, pressure, speed, battery_voltage, etc...).
How can I accomplish this with Pandas?
This is what I have done before
weektime = pd.date_range(start = '06/04/2017', end = '12/05/2017', freq = 'W-SUN')
df['week'] = 'nan'
df['weektemp'] = 'nan'
df['weekhumidity'] = 'nan'
df['weeklight'] = 'nan'
df['weekpressure'] = 'nan'
df['weekspeed'] = 'nan'
df['weekbattery_voltage'] = 'nan'
for i in range(0,len(weektime)):
df['week'][i+1] = weektime[i]
df['weektemp'][i+1] = df['temperature'].iloc[7*i+1:7*i+7].sum()
df['weekhumidity'][i+1] = df['humidity'].iloc[7*i+1:7*i+7].sum()
df['weeklight'][i+1] = df['light'].iloc[7*i+1:7*i+7].sum()
df['weekpressure'][i+1] = df['pressure'].iloc[7*i+1:7*i+7].sum()
df['weekspeed'][i+1] = df['speed'].iloc[7*i+1:7*i+7].sum()
df['weekbattery_voltage'][i+1] =
df['battery_voltage'].iloc[7*i+1:7*i+7].sum()
i = i + 1
The value of sum is not correct. Cause the value of 2017-06-17 is a sum of 2017-06-12 to 2017-06-16. I do not want to add them again. This gap is not only one gap in the period. I want to fill all of them.
Here is a function I wrote that might be helpful to you. It looks for inconsistent jumps in time and fills them in. After using this function, try using a linear interpolation function (pandas has a good one) to fill in your null data values. Note: Numpy arrays are much faster to iterate over and manipulate than Pandas dataframes, which is why I switch between the two.
import numpy as np
import pandas as pd
data_arr = np.array(your_df)
periodicity = 'daily'
def fill_gaps(data_arr, periodicity):
rows = data_arr.shape[0]
data_no_gaps = np.copy(data_arr) #avoid altering the thing you're iterating over
data_no_gaps_idx = 0
for row_idx in np.arange(1, rows): #iterate once for each row (except the first record; nothing to compare)
oldtimestamp_str = str(data_arr[row_idx-1, 0])
oldtimestamp = np.datetime64(oldtimestamp_str)
currenttimestamp_str = str(data_arr[row_idx, 0])
currenttimestamp = np.datetime64(currenttimestamp_str)
period = currenttimestamp - oldtimestamp
if period != np.timedelta64(900,'s') and period != np.timedelta64(3600,'s') and period != np.timedelta64(86400,'s'):
if periodicity == 'quarterly':
desired_period = 900
elif periodicity == 'hourly':
desired_period = 3600
elif periodicity == 'daily':
desired_period = 86400
periods_missing = int(period / np.timedelta64(desired_period,'s'))
for missing in np.arange(1, periods_missing):
new_time_orig = str(oldtimestamp + missing*(np.timedelta64(desired_period,'s')))
new_time = new_time_orig.replace('T', ' ')
data_no_gaps = np.insert(data_no_gaps, (data_no_gaps_idx + missing),
np.array((new_time, np.nan, np.nan, np.nan, np.nan, np.nan)), 0) # INSERT VALUES YOU WANT IN THE NEW ROW
data_no_gaps_idx += (periods_missing-1) #incriment the index (zero-based => -1) in accordance with added rows
data_no_gaps_idx += 1 #allow index to change as we iterate over original data array (main for loop)
#create a dataframe:
data_arr_no_gaps = pd.DataFrame(data=data_no_gaps, index=None,columns=['Time', 'temp', 'humidity', 'light', 'pressure', 'speed'])
return data_arr_no_gaps
Fill time gaps and nulls
Use the function below to ensure expected date sequence exists, and then use forward fill to fill in nulls.
import pandas as pd
import os
def fill_gaps_and_nulls(df, freq='1D'):
'''
General steps:
A) check for extra dates (out of expected frequency/sequence)
B) check for missing dates (based on expected frequency/sequence)
C) use forwardfill to fill nulls
D) use backwardfill to fill remaining nulls
E) append to file
'''
#rename the timestamp to 'date'
df.rename(columns={"timestamp": "date"})
#sort to make indexing faster
df = df.sort_values(by=['date'], inplace=False)
#create an artificial index of dates at frequency = freq, with the same beginning and ending as the original data
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq=freq)
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
#check for extra dates and/or dates out of order. print warning statement for log
extra_dates = set(df.date).difference(all_dates)
#if there are extra dates (outside of expected sequence/frequency), deal with them
if len(extra_dates) > 0:
#############################
#INSERT DESIRED BEHAVIOR HERE
print('WARNING: Extra date(s):\n\t{}\n\t Shifting highlighted date(s) back by 1 day'.format(extra_dates))
for date in extra_dates:
#shift extra dates back one day
df.date[df.date == date] = date - pd.Timedelta(days=1)
#############################
#check the artificial date index against df to identify missing gaps in time and fill them with nulls
gaps = all_dates.difference(set(df.date))
print('\n-------\nWARNING: Missing dates: {}\n-------\n'.format(gaps))
#if there are time gaps, deal with them
if len(gaps) > 0:
#initialize df of correct size, filled with nulls
gaps_df = pd.DataFrame(index=gaps, columns=df_cols.drop('date')) #len(index) sets number of rows
#give index a name
gaps_df.index.name = 'date'
#add the region and type
gaps_df.region = r
gaps_df.type = t
#remove that index so gaps_df and df are compatible
gaps_df.reset_index(inplace=True)
#append gaps_df to df
new_df = pd.concat([df, gaps_df])
#sort on date
new_df.sort_values(by='date', inplace=True)
#fill nulls
new_df.fillna(method='ffill', inplace=True)
new_df.fillna(method='bfill', inplace=True)
#append to file
new_df.to_csv('ffill_df.csv', mode='a', header=False, index=False)
return df_cols, regions, types, all_dates

How do I filter columns of multiple DataFrames stored in a dictionary in an efficient way?

I am working with stock data and I want to make my data sets have equal length of data when performing certain types of analysis.
Problem
If I a load data for Apple I will get daily data since 1985 but if load data for a Natural Gas ETF it might only go as far back as 2012. I now want to filter Apple to only show history going back to 2012. Also, the end date, for example some of my dataset may not be up to date as Apple data is ranging from 1985 to 1-20-17 and the Natural Gas ETF data has a range of 2012 to 12-23-16. I also want another filter that sets the max date. So now my apple data set is filtered for dates ranging between 2012 to 12-23-16. Now my datasets are equal.
Approach
I have a dictionary called Stocks which stores all of my dateframes. All the dataframes have a column named D which is the Date column.
I wrote a function that populates a dictionary with the dataframes and also takes the min and max dates for each df. I store all those min max dates in two other dictionaries DatesMax and DateMin and then take the min and the max of those two dictionaries to get the max and the min dates that will be used for the filter value on all the dataframes.
The function below works, it gets the min and max dates of multiple dataframes and returns them in a dictionary named DatesMinMax.
def MinMaxDates (FileName):
DatesMax = {}; DatesMin = {}
DatesMinMax = {}; stocks = {}
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
for i in tickers:
a = '/' in i
if a == True:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
else:
df = pd.read_csv(i + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
x = min(DatesMax.values())
y = max(DatesMin.values())
DatesMinMax = {'MaxDate' : x, 'MinDate' : y}
return DatesMinMax
print DatesMinMax
# {'MinDate': '2012-02-08', 'MaxDate': '2017-01-20'}
Question
Now, I will have to run my loop on all the dataframes in the dict name Stocks to filter there date columns. It seems inefficient to re-loop something again, but I can't think of any other other way to apply the filter.
Actually, you may not need to capture min and max (since 2016-12-30 < 2017-01-20) for later filtering, but simply run a full inner join merge across all dataframes on 'D' (Date) column.
Consider doing so with a chain merge which ensures equal lengths across all dataframe, and then slice this outputted master dataframe by ticker columns to build the Stocks dictionary. Of course, you can use the wide master dataframe for analysis:
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
# DATA FRAME LIST BUILD
dfs = []
for i in tickers:
if '/' in i:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
else:
df = pd.read_csv(i + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
# CHAIN MERGE (INNER JOIN) ACROSS ALL DFS
masterdf = reduce(lambda left,right: pd.merge(left, right, on=['D']), dfs)
# DATA FRAME DICT BUILD
stocks = {}
for i in tickers:
# SLICE CURRENT TICKER COLUMNS
df = masterdf[['D']+[col for col in df.columns if i in col]]
# REMOVE TICKER PREFIXES
df.columns = [col.replace(i+'_', '') for col in df.columns]
stocks[i] = df

Categories