Create unique dataframe - python

For each city (here : NY, Chicago) I have 3 csv files with 2 columns like this :
file 1 : ID, 20101201
file 2 : ID, 20101202
file 3 : ID, 20101203
Each file name is like this : "Chicago_ID_20101201.csv"
The 2nd column name is representing a date in this format YYYYMMDD.
I want to create a unique file for each city with a dataframe containing 4 columns: ID and the 3 other columns referring to each date in these files.
cities = ["NY","Chicago"]
dates = ["20101201", "20101202","20101203"]
for city in cities:
df = pd.DataFrame()
for date in dates:
file_name = f'{city}_ID_{date}.csv'
df[date] = pd.read_csv('[...]')
print(df[date])
Plus i would like to know if there is a way to avoid giving the list of dates in the case that i would want to do it for an entire month.
Thanks

Use pathlib:
import pandas as pd
import pathlib
import collections
# the path to your csv files
DATA_DIR = pathlib.Path("cities")
cities = collections.defaultdict(list)
# Collect data
for file in DATA_DIR.glob('*_ID_*.csv'):
city = file.stem.split('_')[0]
df = pd.read_csv(file, dtype=object).drop_duplicates('ID')
cities[city].append(df.set_index('ID'))
# Build city files
for city in cities:
df = pd.concat(cities[city], axis=1).reset_index()
df.to_excel(DATA_DIR / f'{city}.xlsx', index=False)
Now you have two files Chicago.xlsx and NY.xlsx.

You can read each dataframe, store them in a list, set the ID as index and concatenate them to get one ID column and three other date columns:
cities = ["NY","Chicago"]
dates = ["20101201", "20101202","20101203"]
for city in cities:
df_list=[]
for date in dates:
file_name = f'{city}_ID_{date}.csv'
df_list.append(pd.read_csv(file_name, index_col='ID'))
df = pd.concat(df_list, axis=1)
print(f'This is the dataframe for {city}', df)
For your second question, you can create any date range, using pandas daterange:
pd.date_range(start="20101101", end="20101201", freq='D').strftime('%Y%m%d')
Output:
Index(['20101101', '20101102', '20101103', '20101104', '20101105', '20101106',
'20101107', '20101108', '20101109', '20101110', '20101111', '20101112',
'20101113', '20101114', '20101115', '20101116', '20101117', '20101118',
'20101119', '20101120', '20101121', '20101122', '20101123', '20101124',
'20101125', '20101126', '20101127', '20101128', '20101129', '20101130',
'20101201'],
dtype='object')

Related

parse xlsx file having merged cells using python or pyspark

I want to parse an xlsx file. Some of the cells in the file are merged and working as a header for the underneath values.
But do not know what approach I should select to parse the file.
Shall I parse the file from xlsx to json format and then I should perform the pivoting or transformation of dataset.
OR
Shall proceed just by xlsx format and try to read the specific cell values- but I believe this approach will not make the code scalable and dynamic.
I tried to parse the file and tried to convert to json but it did not load the all the records. unfortunately, it is not throwing any exception.
from json import dumps
from xlrd import open_workbook
# load excel file
wb = open_workbook('/dbfs/FileStore/tables/filename.xlsx')
# get sheet by using sheet name
sheet = wb.sheet_by_name('Input Format')
# get total rows
total_rows = sheet.nrows
# get total columns
total_columns = sheet.ncols
# convert each row of sheet name in Dictionary and append to list
lst = []
for i in range(0, total_rows):
row = {}
for j in range(0, total_columns):
if i + 1 < total_rows:
column_name = sheet.cell(rowx=0, colx=j)
row_data = sheet.cell_value(rowx=i+1, colx=j)
row.update(
{
column_name.value: row_data
}
)
if len(row):
lst.append(row)
# convert into json
json_data = dumps(lst)
print(json_data)
After executing the above code I received following type of output:
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FELIX PARTY.MIX",
"": 2.9969042460942
},
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FRISKIES ESTERILIZADOS",
"": 2.0046260994622
},
Once the data will be in good shape then spark-databricks should be used for the transformation.
I tried multiple approaches but failed :(
Hence seeking help from the community.
For more clarity on the question I have added sample input/output screenshot as following.
Input dataset:
Expected Output1:
You can download the actual dataset and expected output from the following link
Dataset
To convert get the month column as per requirement, you can use the following code:
import pandas as pd
for_cols = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl', skiprows=2,nrows=1)
main_cols = [for_cols[req][0] for req in for_cols if type(for_cols[req][0])==type('x')] #getting main header column names
#print(main_cols)
for_dates = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4,usecols="C:R")
dates = for_dates.columns.to_list() #getting list of month names to be used
#print(dates)
pdf = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4) #reading the file without main headers
#pdf
#all the columns i.e., 2021 Jan will be labeled differently like 2021 Jan, 2021 Jan.1, 2021 Jan.2 and so on. So the following code will create an array of arrays where each of the child array will be used to create a new small dataframe. All these new dataframes will be combined to a single dataframe (union).
req_cols=[]
for i in range(len(main_cols)):
current_dates = ['Market','Product']
if(i!=0):
for d in dates:
current_dates.append(d+f'.{i}')
else:
current_dates.extend(dates)
req_cols.append(current_dates)
print(req_cols)
#the following code will combine the dataframe to remove multiple yyyy MMM columns. Also added a column `stype` whose name would help identify to which main header column does the month belongs to for each product.
mydf = pdf[req_cols[0]]
mydf['stype']= main_cols[0]
#display(mydf)
for i in range(1,len(req_cols)):
temp = pdf[req_cols[i]]
#print(temp.columns)
temp['stype'] = main_cols[i]
rename_cols={'Market': 'Market', 'Product': 'Product','stype':'stype'} #renaming columns i.e., changing 2021 Jan.1 and such to just 2021 Jan.
for j in req_cols[i][2:]:
rename_cols[j]= j[:8] #if j is 2021 Jan.3 then we only take until j[:8] to get the actual name (2021 Jan)
#print(rename_cols)
temp.rename(columns = rename_cols, inplace = True)
mydf = pd.concat([mydf,temp]) #combining the child dataframes to main dataframe.
mydf
tp = mydf[['Market','Product','2021 Jan','stype']]
req_df = tp.pivot(index=['Product','Market'],columns='stype', values='2021 Jan') #now pivoting the `stype` column
req_df['month'] = ['2021 Jan']*len(req_df) #initialising the month column
req_df.reset_index(inplace=True) #converting index columns to actual columns.
req_df #required data format for 2021 Jan.
#using the following code to get required result. Do it separately for each of the dates and then combine it to `req_df`
for dt in dates[1:]:
tp = mydf[['Market','Product',dt,'stype']]
tp1 = tp.pivot(index=['Product','Market'],columns='stype', values=dt)
tp1['month'] = [dt]*len(tp1)
tp1.reset_index(inplace=True)
req_df = pd.concat([req_df,tp1])
display(req_df[(req_df['Product'] != 'Nestle Purina')]) #selecting only data where product name is not Nestle Purina
To create a new column called Nestle Purina for one of the main columns (Penetration) you can use the following code:
nestle_purina = req_df[(req_df['Product'] == 'Nestle Purina')] #where product name is Nestle Purina
b = req_df[(req_df['Product'] != 'Nestle Purina')] #where product name is not nestle purina
a = b[['Product','Market','month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting required columns along with main column Penetration
n = nestle_purina[['month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting only required columns from nestle_purina df.
import numpy as np
a['Nestle Purina'] = np.nan #creating empty column to populate using code below
for dt in dates:
val = [i for i in n[(n['month'] == dt)]['Penetration % (% of Households who bought a product atleast once in the given time period)']] #getting the corresponding Nestle Purina value for Penetration column
a.loc[a['month'] == dt, 'Nestle Purina'] = val[0] #updating the `Nestle Purina` column value from nan to value extracted above.
a

In Pandas, how to merge multiple CSV files with a unnamed date index

I have a bunch of file all having the same format, note the first column does not have a name.
USD_EUR USD_JPY USD_GBP USD_AUD USD_CAD USD_CHF USD_HKD
1/1/2000 0.995421063 102.2596058 0.618853275 1.535138364 1.454111089 1.597750348 7.767569182
1/2/2000 0.995421063 102.2596058 0.618853275 1.535138364 1.454111089 1.597750348 7.767569182
1/3/2000 0.991080278 101.8334985 0.619028741 1.520911794 1.444697721 1.589990089 7.792269574
1/4/2000 0.970402717 102.7462397 0.610965551 1.52130034 1.449393498 1.557787482 7.782726832
1/5/2000 0.964506173 103.5300926 0.609953704 1.521315586 1.453028549 1.548996914 7.776716821
1/6/2000 0.962649211 104.6592222 0.606661533 1.523681171 1.452733924 1.546784752 7.782345014
How do I load all of them into a dataframe with the date as index? Here is what I have:
files = glob.glob(f"./Data_Forex/*")
if(ForexCache is None):
ForexCache = []
for file in files:
filename = Path(file).stem
df_fx = pd.read_csv(f"{file}")
df_fx.iloc[:,0] = df_fx.iloc[:, 0].apply(lambda x: datetime.strptime(x, "%Y-%m-%d"))
df_fx.set_index(df_fx.index, inplace=True)
ForexCache.append(df_fx)
ForexCache = functools.reduce(lambda left,right: pd.merge(left,right,left_index=True, right_index=True, how='outer'), ForexCache)
The result is a bunch of empty rows with the index date but no values and all the columns are duplicated for each file, so the columns didn't get merged, what am I doing wrong?
Suppose that all your files are in root_folder, you can get a DataFrame with the content of all your files and sorted by date in this way:
import os
import pandas as pd
df = pd.concat([
pd.read_csv(os.path.join(root_folder, filename), delim_whitespace=True, parse_dates=True, dayfirst=True)
for filename in next(os.walk(root_folder))[2]
]).sort_index()

Pyspark dataframe join based on key,group by and max

i have two parquet files, which i load with spark.read. These 2 dataframes have a same column named key, so i join them with:
df = df.join(df2, on=['key'], how='inner')
df columns are: ["key","Duration","Distance"] and df2 : ["key",department id"]. At the end i want to print Duration, max(Distance),department id group by department id. What i have done so far is:
df.join(df.groupBy('departmentid').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
but i think it is too slow, is there a faster way to achieve my goal?
thanks in advance
EDIT: sample (first 2 lines of each file)
df:
369367789289,2015-03-27 18:29:39,2015-03-27 19:08:28,-
73.975051879882813,40.760562896728516,-
73.847900390625,40.732685089111328,34.8
369367789290,2015-03-27 18:29:40,2015-03-27 18:38:35,-
73.988876342773438,40.77423095703125,-
73.985160827636719,40.763439178466797,11.16
df1:
369367789289,1
369367789290,2
each columns is seperated by "," first column on both files is my key, then i have timestamps,longtitudes and latitudes. At the second file i have only the key and department id.
to create Distance i am using a function called formater. this is how i get my distance and duration:
df = df.filter("_c3!=0 and _c4!=0 and _c5!=0 and _c6!=0")
df = df.withColumn("_c0", df["_c0"].cast(LongType()))
df = df.withColumn("_c1", df["_c1"].cast(TimestampType()))
df = df.withColumn("_c2", df["_c2"].cast(TimestampType()))
df = df.withColumn("_c3", df["_c3"].cast(DoubleType()))
df = df.withColumn("_c4", df["_c4"].cast(DoubleType()))
df = df.withColumn("_c5", df["_c5"].cast(DoubleType()))
df = df.withColumn("_c6", df["_c6"].cast(DoubleType()))
df = df.withColumn('Distance', formater(df._c3,df._c5,df._c4,df._c6))
df = df.withColumn('Duration', F.unix_timestamp(df._c2) -F.unix_timestamp(df._c1))
and then as i showed above:
df = df.join(vendors, on=['key'], how='inner')
df.registerTempTable("taxi")
df.join(df.groupBy('vendor').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
Output must be
Distance Duration department id
grouped by id, and geting only the row with max(distance)

Pandas create new column every time function runs

This is data.csv:
tickers = ['ACOR', 'ACM', 'ACLS', 'ACND', 'ACMR']
stats = ['mkt_cap', 'price', 'change']
This code creates a csv file for each stat in the assets directory:
date = str(dt.date.today())
for stat in stats:
df = pd.read_csv('data.csv')
df.set_index('ticker', inplace=True)
df = df.loc[tickers, ['{}'.format(stat)]]
date = str(dt.date.today())
df.rename(columns = {'{}'.format(stat):date}, inplace=True)
df.to_csv(assets/{}.csv'.format(stats))
Here is price.csv
ticker 2019/07/04
ACOR 7.42
ACM 37.33
... ...
The problem is I need a new column to be created every time this function is run with the current date as the header. Data.csv gets updated everyday and I would like to add new data into mkt_cap.csv, prices.csv and change.csv with the new date as the header. The updated prices.csv would look like:
ticker 2019/07/04 2019/07/05
ACOR 7.42 XXX
ACM 37.33 XXX
... ...
EDIT:
date = str(dt.date.today())
for stat in stats:
df = pd.read_csv('data.csv')
df.set_index('ticker', inplace=True)
df = df.loc[tickers, ['{}'.format(stat)]]
date = str(dt.date.today())
df.rename(columns = {'{}'.format(stat):date}, inplace=True)
df.to_csv(assets/{}.csv'.format(stats))
for col in stats.columns:
stats["{}-{}".format(dt.date.today(),col)] = stats[col]
dataframes = []
for datapoint in stats.columns[-5:-1]:
dataframes.append(stats[[datapoint, "ticker"]])
for dff in dataframes:
dff.to_csv('assets/{}.csv'.format(dff.columns[1]))
import pandas as pd
list1 = []
for i in range(0,10):
list1.append(i)
df = pd.DataFrame()
df["col1"] = list1
df['col2'] = df['col1']+5
import datetime as dt
def new_col(df):
df[dt.datetime.now()] = df['col1']+ df['col2']
return df
new_col(df)
This will create a new column when the function is called with the datetime the function is run. Not entirely sure what you are trying to do as far as the arithmetic of the new column but this should do the trick as far as creating the new column.
for col in acor.columns: #or you could just use your stat list
acor["{}-{}".format(dt.datetime.now(),col)] = acor[col]
dataframes = [] ##seperates into individual dataframes
for datapoint in acor.columns[-5:-1]:
dataframes.append(acor[[datapoint,"timestamp"]])###you probobly want to replace timestamp with "symbol" or "ticker"
###finally saves dataframes by date and stat
for dff in dataframes:
dff.to_csv("{}.csv".format(dff.columns[1]))

How do I filter columns of multiple DataFrames stored in a dictionary in an efficient way?

I am working with stock data and I want to make my data sets have equal length of data when performing certain types of analysis.
Problem
If I a load data for Apple I will get daily data since 1985 but if load data for a Natural Gas ETF it might only go as far back as 2012. I now want to filter Apple to only show history going back to 2012. Also, the end date, for example some of my dataset may not be up to date as Apple data is ranging from 1985 to 1-20-17 and the Natural Gas ETF data has a range of 2012 to 12-23-16. I also want another filter that sets the max date. So now my apple data set is filtered for dates ranging between 2012 to 12-23-16. Now my datasets are equal.
Approach
I have a dictionary called Stocks which stores all of my dateframes. All the dataframes have a column named D which is the Date column.
I wrote a function that populates a dictionary with the dataframes and also takes the min and max dates for each df. I store all those min max dates in two other dictionaries DatesMax and DateMin and then take the min and the max of those two dictionaries to get the max and the min dates that will be used for the filter value on all the dataframes.
The function below works, it gets the min and max dates of multiple dataframes and returns them in a dictionary named DatesMinMax.
def MinMaxDates (FileName):
DatesMax = {}; DatesMin = {}
DatesMinMax = {}; stocks = {}
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
for i in tickers:
a = '/' in i
if a == True:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
else:
df = pd.read_csv(i + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
x = min(DatesMax.values())
y = max(DatesMin.values())
DatesMinMax = {'MaxDate' : x, 'MinDate' : y}
return DatesMinMax
print DatesMinMax
# {'MinDate': '2012-02-08', 'MaxDate': '2017-01-20'}
Question
Now, I will have to run my loop on all the dataframes in the dict name Stocks to filter there date columns. It seems inefficient to re-loop something again, but I can't think of any other other way to apply the filter.
Actually, you may not need to capture min and max (since 2016-12-30 < 2017-01-20) for later filtering, but simply run a full inner join merge across all dataframes on 'D' (Date) column.
Consider doing so with a chain merge which ensures equal lengths across all dataframe, and then slice this outputted master dataframe by ticker columns to build the Stocks dictionary. Of course, you can use the wide master dataframe for analysis:
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
# DATA FRAME LIST BUILD
dfs = []
for i in tickers:
if '/' in i:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
else:
df = pd.read_csv(i + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
# CHAIN MERGE (INNER JOIN) ACROSS ALL DFS
masterdf = reduce(lambda left,right: pd.merge(left, right, on=['D']), dfs)
# DATA FRAME DICT BUILD
stocks = {}
for i in tickers:
# SLICE CURRENT TICKER COLUMNS
df = masterdf[['D']+[col for col in df.columns if i in col]]
# REMOVE TICKER PREFIXES
df.columns = [col.replace(i+'_', '') for col in df.columns]
stocks[i] = df

Categories