Data Extraction from multiple excel files in pandas dataframe - python

I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in the pandas data frame. The structuring of the tabs in each of the excel files is the same. Each tab of the excel file should be a separate data frame. As of now, I have created a list of data frames for each excel file that holds all the data from all the tabs as it is concatenated. But, I'm trying to find a way to access each excel from a data structure and each tab of that excel file as a separate data frame. Below mentioned is the current code. Any improvisation would be appreciated!! Please let me know if anything else is needed.
#Assigning the path to the folder variable
folder = 'specified_path'
#Getting the list of files from the assigned path
excel_files = [file for file in os.listdir(folder)]
list_of_dfs = []
for file in excel_files :
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None), ignore_index=True)
df['excelfile_name'] = file.split('.')[0]
list_of_dfs.append(df)

I would propose to change the line
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None), ignore_index=True)
to
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None))
df.index = df.index.get_level_values(0)
df.reset_index().rename({'index':'Tab'}, axis=1)

To create a separate dataframe for each tab (with duplicated content) in an Excel file, one could iterate over index level 0 values and index with it:
df = pd.concat(pd.read_excel(filename, sheet_name=None))
list_of_dfs = []
for tab in df.index.get_level_values(0).unique():
tab_df = df.loc[tab]
list_of_dfs.append(tab_df)
For illustration, here is the dataframe content after reading an Excel file with 3 tabs:
After running the above code, here is the content of list_of_dfs:
[ Date Reviewed Adjusted
0 2022-07-11 43 20
1 2022-07-18 16 8
2 2022-07-25 8 3
3 2022-08-01 17 3
4 2022-08-15 14 6
5 2022-08-22 12 5
6 2022-08-29 8 4,
Date Reviewed Adjusted
0 2022-07-11 43 20
1 2022-07-18 16 8
2 2022-07-25 8 3
3 2022-08-01 17 3
4 2022-08-15 14 6
5 2022-08-22 12 5
6 2022-08-29 8 4,
Date Reviewed Adjusted
0 2022-07-11 43 20
1 2022-07-18 16 8
2 2022-07-25 8 3
3 2022-08-01 17 3
4 2022-08-15 14 6
5 2022-08-22 12 5
6 2022-08-29 8 4]

Related

How to scrape zip files into a single dataframe in python

I am very new to web scrapping and I am trying to understand how I can scrape all the zip files and regular files that are on this website. The end goal is the scrape all the data, I was originally thinking I could use pd.read_html and feed in a list of each link and loop through each zip file.
I am very new to web scraping so any help at all would be very useful, I have tried a few examples this far please see the below code
import pandas as pd
pd.read_html("https://www.omie.es/en/file-access-list?parents%5B0%5D=/&parents%5B1%5D=Day-ahead%20Market&parents%5B2%5D=1.%20Prices&dir=%20Day-ahead%20market%20hourly%20prices%20in%20Spain&realdir=marginalpdbc",match="marginalpdbc_2017.zip")
So this is what I would like the output to look like except each zip file would need to be its own data frame to work with/loop through. Currently, all it seems to be doing is downloading all the names of the zip files, not the actual data.
Thank you
To open a zipfile and read the files there to a dataframe you can use next example:
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
zip_url = "https://www.omie.es/es/file-download?parents%5B0%5D=marginalpdbc&filename=marginalpdbc_2017.zip"
dfs = []
with ZipFile(BytesIO(requests.get(zip_url).content)) as zf:
for file in zf.namelist():
df = pd.read_csv(
zf.open(file),
sep=";",
skiprows=1,
skipfooter=1,
engine="python",
header=None,
)
dfs.append(df)
final_df = pd.concat(dfs)
# print first 10 rows:
print(final_df.head(10).to_markdown(index=False))
Prints:
0
1
2
3
4
5
6
2017
1
1
1
58.82
58.82
nan
2017
1
1
2
58.23
58.23
nan
2017
1
1
3
51.95
51.95
nan
2017
1
1
4
47.27
47.27
nan
2017
1
1
5
46.9
45.49
nan
2017
1
1
6
46.6
44.5
nan
2017
1
1
7
46.25
44.5
nan
2017
1
1
8
46.1
44.72
nan
2017
1
1
9
46.1
44.22
nan
2017
1
1
10
45.13
45.13
nan

How to transfer some information from one CSV to another CSV using python

I have two CSV files
one large CSV file (Stock.csv) that contains all the information and another one contains partial information(Sold.csv)
Example of the large CSV file(Stock.csv)
Item_No Price
1 20
2 10
3 9.99
4 11
5 13
6 11
7 7.99
the other CSV file contains only the sold items by Item_No with no price) which are now needed
Example of the other CSV file(Sold.csv)
Item_No Price
1
4
7
as you can see it only contains the Item_No , how can I add the price of each Item_No (The file contains more than 30000 items, doing it manually will take ages)
Try:
df1 = pd.read_csv("Stock.csv", sep=r"\s+") # <-- adjust separator accordingly
df2 = pd.read_csv("Sold.csv", sep=r"\s+")
df2["Price"] = df2["Item_No"].map(df1.set_index("Item_No")["Price"])
print(df2)
df2.to_csv("output.csv", index=False)
Prints:
Item_No Price
0 1 20.00
1 4 11.00
2 7 7.99
and creates output.csv.

How to Open a Text File and Create an Array in Python

I have a text file called Orbit 1 and I need help opening it and then creating three separate arrays. I'm new to Python and have been having difficulty with this aspect. Here are the first few rows of my text file. There are 1112 rows including the header.
Year Month Day Hour Minute Second Millisecond Longitude Latitude Altitude
2019 3 17 5 55 55 0 108.8730074 50.22483151 412.6226898
2019 3 17 5 56 0 0 108.9895097 50.53642185 412.7368197
2019 3 17 5 56 5 0 109.1078294 50.8478274 412.850563
2019 3 17 5 56 10 0 109.2280101 51.15904424 412.9640113
2019 3 17 5 56 15 0 109.3500969 51.47006828 413.0772319
2019 3 17 5 56 20 0 109.4741362 51.78089533 413.1901358
2019 3 17 5 56 25 0 109.6001758 52.09152105 413.3025291
2019 3 17 5 56 30 0 109.728265 52.40194099 413.414457
2019 3 17 5 56 35 0 109.8584548 52.71215052 413.5259984
2019 3 17 5 56 40 0 109.9907976 53.02214489 413.6371791
I desire to open this text file to create three arrays called lat[N], long[N], and time[N] where N is the number of rows in the file. I ultimately want to be able to determine what the latitude, longitude, and time is at any point. For example, lat[0] should return 50.22483151 if working properly. In addition, for the time, I would need to convert to decimal hours and then create the array.
Essentially I need help with opening this text file I have and then creating the three arrays.
I've tried this method for opening the file, but I get stuck when trying to write the array and I think I may not be opening the file correctly.
import numpy as np
file_name = 'C:\\Users\\Saman\\OneDrive\\Documents\\Orbit 1.txt'
data = []
with open(file_name) as file:
next(file)
for line in file:
row = line.split()
row = [float(x) for x in row]
data.append(row)
The most effortless way to solve your problem is to use Pandas:
import pandas as pd
df = pd.read_table('Orbit 1.txt', sep=r'\s+')
df['Longitude']
#0 108.873007
#1 108.989510
#2 109.107829
#3 109.228010
#4 109.350097
#5 109.474136
#6 109.600176
#7 109.728265
#8 109.858455
#9 109.990798
Once you get a Pandas DataFrame, you may want to use it for the rest of the data processing, too.
file_name = 'info.txt'
Lat=[]
Long=[]
Time=[]
left_justified=lambda x: x+" "*(19-len(x))
right_justified=lambda x: " "*(19-len(x))+x
with open(file_name) as file:
next(file)
for line in file:
data=line.split()
Lat.append(data[8])
Long.append(data[7])
hrs=int(data[3])
minutes=int(data[4])
secs=int(data[5])
total_secs=secs+minutes*60+hrs*3600
Time.append(total_secs/3600)
print(left_justified("Time"),left_justified("Lat"),left_justified("Long"))
for i in range(len(Lat)):
print(left_justified(str(Time[i])),left_justified(Lat[i]),left_justified(Long[i]))
Try this

HTML table in pandas with single header row

I have the following dataframe:
ID mutex add atomic add cas add ys_add blocking ticket queued fifo
Cores
1 21.0 7.1 12.1 9.8 32.2 44.6
2 121.8 40.0 119.2 928.7 7329.9 7460.1
3 160.5 81.5 227.9 1640.9 14371.8 11802.1
4 188.9 115.7 347.6 1945.1 29130.5 15660.1
There is both a column index (ID) and a row index (Cores). When I use DataFrame.to_html(), I get a table like this:
Instead, I'd like a table with a single header row, composed of all the column names (but without the column index name ID) and with the row index name Cores in that same header row, like so:
I'm open to manipulating the dataframe prior to the to_html() call, or adding parameters to the to_html() call, but not messing around with the generated html.
Initial setup:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],
columns = ['attr_a', 'attr_b', 'attr_c', 'attr_c'])
df.columns.name = 'ID'
df.index.name = 'Cores'
df
ID attr_a attr_b attr_c attr_c
Cores
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Then set columns.name to 'Cores', and index.name to None. df.to_html() should then give you the output you want.
df.columns.name='Cores'
df.index.name = None
df.to_html()
Cores attr_a attr_b attr_c attr_c
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16

Reading all csv files at particular folder, Merge them, and find the maximum value w.r.t. row interval

I have 120 files csv files of . It includes IndexNo, date, EArray, temperature, etc.
Here Index Column is vary from 1 to 8760.
I wants to read all csv files from folder and merge them in single file.
Once I merged these files I will have all IndexNo 120 times(i.e IndexNo 1 will have 120 rows).
after this I have to find the maximum value for EArray for each IndexNo (i.e. IndexNo 1 to 8760) and Print that Maximum EArray value row.
import pandas , OS,
glob path = r'C:\Data_Input' # use your path
all_files = glob.glob(path + "/*.csv")
# print(all_files)
li = []
for filename in all_files:
df = pd.read_csv(filename, skiprows=10, names=None, engine='python',header=0, encoding='unicode_escape')
df = df.assign(File_name=os.path.basename(filename).split('.')[0])
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True, sort=False)
frame = frame.dropna()
df = frame.assign(max_EArray=frame.groupby('IndexNo')['EArray'].transform('max')) df_filtered = df[df['EArray'] == df['max_EArray']]
output = df_filtered.loc[df_filtered.ne(0).all(axis=1)]('max_EArray', axis=1) print(output.shape)
output.to_csv('temp.csv')
Your task can be quite easy done using dask (instead of pure Pandas).
One of advantages is that "out of the box" you have the possibility to get
the name of the source file from which there has been read particular row.
My solution is as follows:
Install dask (if you have not installed yet).
Import dask.dataframe:
import dask.dataframe as dd
Define a function to reformat the DataFrame (called individually on
each "partial" DataFrame read from particular .csv file):
def reformat(df):
df.path = df.path.str.extract(r'/(\w+)\.\w+')
return df[['IndexNo', 'EArray', 'path']]
Here you can use "normal" Pandas code. It changes also path,
stripping the directory path, leaving only the file name (without extension).
Define a function to get the "max" row from each group (after grouping
by IndexNo):
def getMax(grp):
wrk = grp.reset_index(drop=True)
ind = wrk.EArray.idxmax()
return wrk.loc[ind, ['EArray', 'path']]
Run the actual processing:
ddf = dd.read_csv('EArray/*.csv', include_path_column=True)
ddf = ddf.map_partitions(reformat)
ddf = ddf.groupby('IndexNo').apply(getMax, meta={'EArray': 'i4', 'path': 'O'})
df = ddf.compute().sort_index().reset_index()
Description:
'EArray/*.csv' - specification of the bunch of source files.
I put all source files in a dedicated subfolder (EArray).
include_path_column=True - adds path column to the DataFrame, containing
full path of the file each row has been read from.
map_partitions(...) - call reformat function individually on each
"partial" DataFrame.
groupby(...) and apply(...) - generally, like in Pandas.
meta - additional argument required in dask (specification of names
and types of columns in the output DataFrame).
compute() - run the processing tree, prepared by the previous instructions.
Now the result is "normal" Pandas DataFrame.
sort_index() and reset_index() - Pandas operations on the result of compute().
For the test I prepared 3 .csv files, with 10 rows each:
T1.csv:
IndexNo date EArray
0 1001 2019-01-01 20
1 1002 2019-01-02 20
2 1003 2019-01-03 20
3 1004 2019-01-04 20
4 1005 2019-01-05 20
5 1006 2019-01-06 20
6 1007 2019-01-07 20
7 1008 2019-01-08 20
8 1009 2019-01-09 20
9 1010 2019-01-10 20
T2.csv:
IndexNo date EArray
0 1001 2019-01-11 22
1 1002 2019-01-12 23
2 1003 2019-01-13 24
3 1004 2019-01-14 25
4 1005 2019-01-15 26
5 1006 2019-01-16 27
6 1007 2019-01-17 28
7 1008 2019-01-18 29
8 1009 2019-01-19 30
9 1010 2019-01-20 31
T3.csv:
IndexNo date EArray
0 1001 2019-01-21 35
1 1002 2019-01-22 34
2 1003 2019-01-23 33
3 1004 2019-01-24 32
4 1005 2019-01-25 31
5 1006 2019-01-26 30
6 1007 2019-01-27 29
7 1008 2019-01-28 28
8 1009 2019-01-29 28
9 1010 2019-01-30 26
The result of my program is:
IndexNo EArray path
0 1001 35 T3
1 1002 34 T3
2 1003 33 T3
3 1004 32 T3
4 1005 31 T3
5 1006 30 T3
6 1007 29 T3
7 1008 29 T2
8 1009 30 T2
9 1010 31 T2
E.g. for IndexNo == 1001 the values of EArray are:
20, 22 and 35 foreach input file.
The result for IndexNo == 1001 contains:
EArray == 35 - the max value from the 3 above,
T3 - the source file containing the "max" row.
I'm aware that you will have to learn dask, but in my opinion
it is worth to put some effort to do it.
Note that my code is quite clear and concise.
Just 7 lines in functions and 4 lined of the main program.

Categories