parse xlsx file having merged cells using python or pyspark - python

I want to parse an xlsx file. Some of the cells in the file are merged and working as a header for the underneath values.
But do not know what approach I should select to parse the file.
Shall I parse the file from xlsx to json format and then I should perform the pivoting or transformation of dataset.
OR
Shall proceed just by xlsx format and try to read the specific cell values- but I believe this approach will not make the code scalable and dynamic.
I tried to parse the file and tried to convert to json but it did not load the all the records. unfortunately, it is not throwing any exception.
from json import dumps
from xlrd import open_workbook
# load excel file
wb = open_workbook('/dbfs/FileStore/tables/filename.xlsx')
# get sheet by using sheet name
sheet = wb.sheet_by_name('Input Format')
# get total rows
total_rows = sheet.nrows
# get total columns
total_columns = sheet.ncols
# convert each row of sheet name in Dictionary and append to list
lst = []
for i in range(0, total_rows):
row = {}
for j in range(0, total_columns):
if i + 1 < total_rows:
column_name = sheet.cell(rowx=0, colx=j)
row_data = sheet.cell_value(rowx=i+1, colx=j)
row.update(
{
column_name.value: row_data
}
)
if len(row):
lst.append(row)
# convert into json
json_data = dumps(lst)
print(json_data)
After executing the above code I received following type of output:
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FELIX PARTY.MIX",
"": 2.9969042460942
},
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FRISKIES ESTERILIZADOS",
"": 2.0046260994622
},
Once the data will be in good shape then spark-databricks should be used for the transformation.
I tried multiple approaches but failed :(
Hence seeking help from the community.
For more clarity on the question I have added sample input/output screenshot as following.
Input dataset:
Expected Output1:
You can download the actual dataset and expected output from the following link
Dataset

To convert get the month column as per requirement, you can use the following code:
import pandas as pd
for_cols = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl', skiprows=2,nrows=1)
main_cols = [for_cols[req][0] for req in for_cols if type(for_cols[req][0])==type('x')] #getting main header column names
#print(main_cols)
for_dates = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4,usecols="C:R")
dates = for_dates.columns.to_list() #getting list of month names to be used
#print(dates)
pdf = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4) #reading the file without main headers
#pdf
#all the columns i.e., 2021 Jan will be labeled differently like 2021 Jan, 2021 Jan.1, 2021 Jan.2 and so on. So the following code will create an array of arrays where each of the child array will be used to create a new small dataframe. All these new dataframes will be combined to a single dataframe (union).
req_cols=[]
for i in range(len(main_cols)):
current_dates = ['Market','Product']
if(i!=0):
for d in dates:
current_dates.append(d+f'.{i}')
else:
current_dates.extend(dates)
req_cols.append(current_dates)
print(req_cols)
#the following code will combine the dataframe to remove multiple yyyy MMM columns. Also added a column `stype` whose name would help identify to which main header column does the month belongs to for each product.
mydf = pdf[req_cols[0]]
mydf['stype']= main_cols[0]
#display(mydf)
for i in range(1,len(req_cols)):
temp = pdf[req_cols[i]]
#print(temp.columns)
temp['stype'] = main_cols[i]
rename_cols={'Market': 'Market', 'Product': 'Product','stype':'stype'} #renaming columns i.e., changing 2021 Jan.1 and such to just 2021 Jan.
for j in req_cols[i][2:]:
rename_cols[j]= j[:8] #if j is 2021 Jan.3 then we only take until j[:8] to get the actual name (2021 Jan)
#print(rename_cols)
temp.rename(columns = rename_cols, inplace = True)
mydf = pd.concat([mydf,temp]) #combining the child dataframes to main dataframe.
mydf
tp = mydf[['Market','Product','2021 Jan','stype']]
req_df = tp.pivot(index=['Product','Market'],columns='stype', values='2021 Jan') #now pivoting the `stype` column
req_df['month'] = ['2021 Jan']*len(req_df) #initialising the month column
req_df.reset_index(inplace=True) #converting index columns to actual columns.
req_df #required data format for 2021 Jan.
#using the following code to get required result. Do it separately for each of the dates and then combine it to `req_df`
for dt in dates[1:]:
tp = mydf[['Market','Product',dt,'stype']]
tp1 = tp.pivot(index=['Product','Market'],columns='stype', values=dt)
tp1['month'] = [dt]*len(tp1)
tp1.reset_index(inplace=True)
req_df = pd.concat([req_df,tp1])
display(req_df[(req_df['Product'] != 'Nestle Purina')]) #selecting only data where product name is not Nestle Purina
To create a new column called Nestle Purina for one of the main columns (Penetration) you can use the following code:
nestle_purina = req_df[(req_df['Product'] == 'Nestle Purina')] #where product name is Nestle Purina
b = req_df[(req_df['Product'] != 'Nestle Purina')] #where product name is not nestle purina
a = b[['Product','Market','month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting required columns along with main column Penetration
n = nestle_purina[['month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting only required columns from nestle_purina df.
import numpy as np
a['Nestle Purina'] = np.nan #creating empty column to populate using code below
for dt in dates:
val = [i for i in n[(n['month'] == dt)]['Penetration % (% of Households who bought a product atleast once in the given time period)']] #getting the corresponding Nestle Purina value for Penetration column
a.loc[a['month'] == dt, 'Nestle Purina'] = val[0] #updating the `Nestle Purina` column value from nan to value extracted above.
a

Related

How to add a new row with new header information in same dataframe

I have written a code to retrieve JSON data from an URL. It works fine. I give the start and end date and it loops through the date range and appends everything to a dataframe.
The colums are populated with the JSON data sensor and its corresponding values, hence the column names are like sensor_1. When I request the data from the URL it sometimes happens that there are new sensors and the old ones are switched off and deliver no data anymore and often times the length of the columns change. In that case my code just adds new columns.
What I want is instead of new columns a new header in the ongoing dataframe.
What I currently get with my code:
datetime;sensor_1;sensor_2;sensor_3;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-01;23.2;43.5;45.2;NaN;NaN;NaN;NaN;NaN;
2023-01-02;13.2;33.5;55.2;NaN;NaN;NaN;NaN;NaN;
2023-01-03;26.2;23.5;76.2;NaN;NaN;NaN;NaN;NaN;
2023-01-04;NaN;NaN;NaN;75;12;75;93;123;
2023-01-05;NaN;NaN;NaN;23;31;24;15;136;
2023-01-06;NaN;NaN;NaN;79;12;96;65;72;
What I want:
datetime;sensor_1;sensor_2;sensor_3;
2023-01-01;23.2;43.5;45.2;
2023-01-02;13.2;33.5;55.2;
2023-01-03;26.2;23.5;76.2;
datetime;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-04;75;12;75;93;123;
2023-01-05;23;31;24;15;136;
2023-01-06;79;12;96;65;72;
My loop to retrieve the data:
start_date = datetime.datetime(2023,1,1,0,0)
end_date = datetime.datetime(2023,1,6,0,0)
sensor_data = pd.DataFrame()
while start_zeit < end_zeit:
q = 'url'
r = requests.get(q)
j = json.loads(r.text)
sub_data = pd.DataFrame()
if 'result' in j:
datetime = pd.to_datetime(np.array(j['result']['data'])[:,0])
sensors = np.array(j['result']['sensors'])
data = np.array(j['result']['data'])[:,1:]
df_new = pd.DataFrame(data, index=datetime, columns=sensors)
sub_data = pd.concat([sub_data, df_new])
sensor_data = pd.concat([sensor_data, sub_data])
start_date += timedelta(days=1)
if 2 DataFrames will do for you the you can simply split using the column names:
df1 = df[['datetime', 'sensor_1', 'sensor_2', 'sensor_3']]
df2 = df[['datetime', 'new_sensor_8', 'new-sensor_9', 'sensor_10', 'sensor_11']]
Note the [[ used.
and use .dropna() to lose the NaN rows

Create unique dataframe

For each city (here : NY, Chicago) I have 3 csv files with 2 columns like this :
file 1 : ID, 20101201
file 2 : ID, 20101202
file 3 : ID, 20101203
Each file name is like this : "Chicago_ID_20101201.csv"
The 2nd column name is representing a date in this format YYYYMMDD.
I want to create a unique file for each city with a dataframe containing 4 columns: ID and the 3 other columns referring to each date in these files.
cities = ["NY","Chicago"]
dates = ["20101201", "20101202","20101203"]
for city in cities:
df = pd.DataFrame()
for date in dates:
file_name = f'{city}_ID_{date}.csv'
df[date] = pd.read_csv('[...]')
print(df[date])
Plus i would like to know if there is a way to avoid giving the list of dates in the case that i would want to do it for an entire month.
Thanks
Use pathlib:
import pandas as pd
import pathlib
import collections
# the path to your csv files
DATA_DIR = pathlib.Path("cities")
cities = collections.defaultdict(list)
# Collect data
for file in DATA_DIR.glob('*_ID_*.csv'):
city = file.stem.split('_')[0]
df = pd.read_csv(file, dtype=object).drop_duplicates('ID')
cities[city].append(df.set_index('ID'))
# Build city files
for city in cities:
df = pd.concat(cities[city], axis=1).reset_index()
df.to_excel(DATA_DIR / f'{city}.xlsx', index=False)
Now you have two files Chicago.xlsx and NY.xlsx.
You can read each dataframe, store them in a list, set the ID as index and concatenate them to get one ID column and three other date columns:
cities = ["NY","Chicago"]
dates = ["20101201", "20101202","20101203"]
for city in cities:
df_list=[]
for date in dates:
file_name = f'{city}_ID_{date}.csv'
df_list.append(pd.read_csv(file_name, index_col='ID'))
df = pd.concat(df_list, axis=1)
print(f'This is the dataframe for {city}', df)
For your second question, you can create any date range, using pandas daterange:
pd.date_range(start="20101101", end="20101201", freq='D').strftime('%Y%m%d')
Output:
Index(['20101101', '20101102', '20101103', '20101104', '20101105', '20101106',
'20101107', '20101108', '20101109', '20101110', '20101111', '20101112',
'20101113', '20101114', '20101115', '20101116', '20101117', '20101118',
'20101119', '20101120', '20101121', '20101122', '20101123', '20101124',
'20101125', '20101126', '20101127', '20101128', '20101129', '20101130',
'20101201'],
dtype='object')

Using pandas to categories text data in one column and have corresponding categories stated in the next column

My excel spread sheet currently looks like this after inserting the new column "Expense" by using the code:
import pandas as pd
df = pd.read_csv(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.csv")
df_Expense = df.insert(2, "Expense", " ")
df.to_excel(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.xlsx", index=None, header=True)
So because the Description column contains the word "DRAKES" I can categories that expense as "Personal" which should appear in the Expense column next to it.
Similarly the next one down contains "Optus" is categorized as a mobile related expense so the word "Phone" should appear in the Expense column.
I have tried searching on Google and YouTube but I just can't seem to find an example for something like this.
Thanks for your help.
You can define a function which has all these rules and simply apply it. For ex.
def rules(x):
if "DRAKES" in x.description:
return "Personal"
if "OPUS" in x.description:
return "Mobile"
df["Expense"] = df.apply(lambda x: rules(x), axis=1)
I have solved my problem by using a while loop. I tried to use the method in quest's answer but I most likely didn't use it properly and kept getting an error. So I used a while loop to search through each individual cell in the "Description" column and categories it in the same row on the "Expense" column.
My solution using a while loop:
import pandas as pd
df = pd.read_csv("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.csv")
df.insert(2, "Expenses", "")
description = "Description"
expense = "Expenses"
transfer = "Transfer"
i = -1 #Because I wanted python to start searching from index 0
while i < 296: #296 is the row where my data ends
i = i + 1
if "Drakes".upper() in df.loc[i, description]:
df.loc[i, expense] = "Personal"
if "Optus".upper() in df.loc[i, description]:
df.loc[i, expense] = "Phone"
df.sort_values(by=["Expenses"], inplace=True)
df.to_excel("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.xlsx", index=False)

Twitterscaper: Adding tweet country info to scraped dataframe

I am using twitterscraper from https://github.com/taspinar/twitterscraper to scrape around 20k tweets created since 2018. Tweet locations are not readily extracted from the default setting. Nevertheless, the search for tweets written from a location can be done by using advanced queries placed within quotes, e.g. "#hashtagofinterest near:US"
Thus I am thinking to loop through a list of country codes (alpha-2) to filter the tweets from a country and add the info of the country to my search result. Initial attempts had been done on small samples for tweets in the past 10 days.
#set arguments
begin_date = dt.date(2020,4,1)
end_date = dt.date(2020,4,11)
lang = 'en'
#define queries
queries = [(f'(#hashtagA OR #hashtagB near:{country})', country) for country in alpha_2]
#initiate queries
dfs = []
for query, country in queries[:10]: #trying on first 10 countries
temp = query_tweets(query, begindate = begin_date, enddate = end_date, lang=lang)
temp = pd.DataFrame(t.__dict__ for t in temp)
temp["country"] = [country]*len(temp)
dfs.append((temp, country))
I managed to add country info as a new variable for each country df.
Part of output: dfs
Part of output: df
However, I am stuck at combining each query result into 1 dataframe. pd.concat() is not working for passing 22 columns on the passed data of 2 columns
unintended result
My intended result is to have a new country column added to the default 21 columns in a dataframe (total 22 intended columns).
intended result
Since dfs is a list of tuples, with each tuple being (DataFrame, str), you only want to concatenate the first of each element of dfs.
You may achieve this using:
concat_df = pd.concat([df for df, _ in dfs], ignore_index=True)
which will create a new list of only the DataFrames and concatenate those. I have added ignore_index=True so that the rows will be re-indexed in the concatenated DataFrame.
Since the country is already stored in the DataFrame, you could also not add this to dfs and only append temp instead:
dfs = []
for query, country in queries[:10]: #trying on first 10 countries
temp = query_tweets(query, begindate = begin_date, enddate = end_date, lang=lang)
temp = pd.DataFrame(t.__dict__ for t in temp)
temp["country"] = [country]*len(temp)
dfs.append(temp)
concat_df = pd.concat(dfs, ignore_index=True)

concat the strings of one column based on condition on other column

I have a data frame that I want to remove duplicates on column named "sample" and the add string information in gene and status columns to new column as shown in the attached pics.
Thank you so much in advance
below is the modified version of data frame.where gene in rows are replaced by actual gene names
Here, df is your Pandas DataFrame.
def new_1(g):
return ','.join(g.gene)
def new_2(g):
return ','.join(g.gene + '-' + g.status)
new_1_data = df.groupby("sample").apply(new_1).to_frame(name="new_1")
new_2_data = df.groupby("sample").apply(new_2).to_frame(name="new_2")
new_data = pd.merge(new_1_data, new_2_data, on="sample")
new_df = pd.merge(df, new_data, on="sample").drop_duplicates("sample")
If you wish to have "sample" as a column instead of an index, then add
new_df = new_df.reset_index(drop=True)
Lastly, as you did not specify which of the original rows of duplicates to retain, I simply use the default behavior of Pandas and drop all but the first occurrence.
Edit
I converted your example to the following CSV file (delimited by ',') which I will call "data.csv".
sample,gene,status
ppar,p53,gain
ppar,gata,gain
ppar,nb,loss
srty,nf1,gain
srty,cat,gain
srty,cd23,gain
tygd,brac1,loss
tygd,brac2,gain
tygd,ras,loss
I load this data as
# Default delimiter is ','. Pass `sep` argument to specify delimiter.
df = pd.read_csv("data.csv")
Running the code above and printing the dataframe produces the output
sample gene status new_1 new_2
0 ppar p53 gain p53,gata,nb p53-gain,gata-gain,nb-loss
3 srty nf1 gain nf1,cat,cd23 nf1-gain,cat-gain,cd23-gain
6 tygd brac1 loss brac1,brac2,ras brac1-loss,brac2-gain,ras-loss
This is exactly the expected output given in your example.
Note that the left-most column of numbers (0, 3, 6) are the remnants of the index of the original dataframes produced after the merges. When you write this dataframe to file you can exclude it by setting index=False for df.to_csv(...).
Edit 2
I checked the CSV file you emailed me. You have a space after the word "gene" in the header of your CSV file.
Change the first line of your CSV file from
sample,gene ,status
to
sample,gene,status
Also, there are spaces in your entries. If you wish to remove them, you can
# Strip spaces from entries. Only works for string entries
df = df.applymap(lambda x: x.strip())
Might not be the most efficient solution but this should get you there:
samples = []
genes= []
statuses = []
for s in set(df["sample"]):
#grab unique samples
samples.append(s)
#get the genes for each sample and concatenate them
g = df["gene"][df["sample"]==s].str.cat(sep=",")
genes.append(g)
#loop through the genes for the sample and get the statuses
status = ''
for gene in g.split(","):
gene_status = df["status"][(df["sample"] == s) & (df["gene"] == gene)].to_string(index=False)
status += gene
status += "-"
status += gene_status
status += ','
statuses.append(status)
#create new df
new_df = pd.DataFrame({'sample': samples,
'new': genes,
'new1': statuses})

Categories