How to scrape zip files into a single dataframe in python

How to scrape zip files into a single dataframe in python - python

I am very new to web scrapping and I am trying to understand how I can scrape all the zip files and regular files that are on this website. The end goal is the scrape all the data, I was originally thinking I could use pd.read_html and feed in a list of each link and loop through each zip file.
I am very new to web scraping so any help at all would be very useful, I have tried a few examples this far please see the below code
import pandas as pd
pd.read_html("https://www.omie.es/en/file-access-list?parents%5B0%5D=/&parents%5B1%5D=Day-ahead%20Market&parents%5B2%5D=1.%20Prices&dir=%20Day-ahead%20market%20hourly%20prices%20in%20Spain&realdir=marginalpdbc",match="marginalpdbc_2017.zip")
So this is what I would like the output to look like except each zip file would need to be its own data frame to work with/loop through. Currently, all it seems to be doing is downloading all the names of the zip files, not the actual data.
Thank you

To open a zipfile and read the files there to a dataframe you can use next example:
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
zip_url = "https://www.omie.es/es/file-download?parents%5B0%5D=marginalpdbc&filename=marginalpdbc_2017.zip"
dfs = []
with ZipFile(BytesIO(requests.get(zip_url).content)) as zf:
for file in zf.namelist():
df = pd.read_csv(
zf.open(file),
sep=";",
skiprows=1,
skipfooter=1,
engine="python",
header=None,
)
dfs.append(df)
final_df = pd.concat(dfs)
# print first 10 rows:
print(final_df.head(10).to_markdown(index=False))
Prints:
0
1
2
3
4
5
6
2017
1
1
1
58.82
58.82
nan
2017
1
1
2
58.23
58.23
nan
2017
1
1
3
51.95
51.95
nan
2017
1
1
4
47.27
47.27
nan
2017
1
1
5
46.9
45.49
nan
2017
1
1
6
46.6
44.5
nan
2017
1
1
7
46.25
44.5
nan
2017
1
1
8
46.1
44.72
nan
2017
1
1
9
46.1
44.22
nan
2017
1
1
10
45.13
45.13
nan

Related

Pandas read_excel wrong output

Update: so I'm messing around with the two files, I copied the data from DOB column from the 2nd file to the 1st file to make the files look visually identical. However I notice some really interesting behaviour when using Ctrl+F in Microsoft Excel. When I leave the search box blank in the first file it finds no matches. However when I repeat the same operation with the 2nd file it finds 21 matches from each cell between E1 to G7. I suppose somehow there are some blank/invisible cells in the 2nd file - and that's what's causing read_excel to behave differently.
My goal is to simply execute the pandas read_excel function. However, I'm running into a very strange situation in which I am trying to run pandas.read_excel on two very similar excel files but am getting substantially different results.
Code
import pandas
data1 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_1.xlsx')
print(data1)
data2 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_2.xlsx')
print(data2)
Output
Name DOB Class Year GPA
0 Redeye438 2008-09-22 Fresh 1
1 Redeye439 2009-09-20 Soph 2
2 Redeye440 2010-09-22 Junior 3
3 Redeye441 2011-09-20 Senior 4
4 Redeye442 2008-09-20 Fresh 4
5 Redeye443 2009-09-22 Soph 3
Name DOB Class Year GPA
Redeye438 2011-09-20 Fresh 1 NaN NaN NaN
Redeye439 2010-09-22 Soph 2 NaN NaN NaN
Redeye440 2009-09-20 Junior 3 NaN NaN NaN
Redeye441 2008-09-22 Senior 4 NaN NaN NaN
Redeye442 2011-09-22 Fresh 4 NaN NaN NaN
Redeye443 2010-09-20 Soph 3 NaN NaN NaN
Why are the columns mapped incorrectly for data2?
The excel files in question (the only difference is the data in the DOB column):
Excel file downloads

It looks like you're using an older version of Pandas because I can't reproduce the issue on the latest version.
import pandas as pd
pd.show_versions()
INSTALLED VERSIONS
------------------
python : 3.10.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
pandas : 1.5.0
As you can see below, the columns of data2 are mapped correctly.
import pandas as pd
data1 = pd.read_excel(r"C:\Users\abokey\Downloads\test_1.xlsx")
print(data1)
data2 = pd.read_excel(r"C:\Users\abokey\Downloads\test_2.xlsx")
print(data2)
Name DOB Class Year GPA
0 Redeye438 2008-09-22 Fresh 1
1 Redeye439 2009-09-20 Soph 2
2 Redeye440 2010-09-22 Junior 3
3 Redeye441 2011-09-20 Senior 4
4 Redeye442 2008-09-20 Fresh 4
5 Redeye443 2009-09-22 Soph 3
Name DOB Class Year GPA
0 Redeye438 2011-09-20 Fresh 1
1 Redeye439 2010-09-22 Soph 2
2 Redeye440 2009-09-20 Junior 3
3 Redeye441 2008-09-22 Senior 4
4 Redeye442 2011-09-22 Fresh 4
5 Redeye443 2010-09-20 Soph 3
However, you're right because the format of the two Excel files are not the same. In fact, when looking at test2.xlsx more carefully, this one seems to carry 7x3 blank cells (rows/cols). The latest version of pandas seems to handle this kind of Excel file since the empty cells are ignored when calling pandas.read_excel.
So in principle, upgrading your pandas version should fix the problem :
pip install --upgrade pandas
If the problem persists, clean your Excel file test2.xlsx by doing this :
Open up the file in Excel
Click on Find & Select
Choose Go To Special and Blanks
Click on Clear All
Save your changes

Data Extraction from multiple excel files in pandas dataframe

I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in the pandas data frame. The structuring of the tabs in each of the excel files is the same. Each tab of the excel file should be a separate data frame. As of now, I have created a list of data frames for each excel file that holds all the data from all the tabs as it is concatenated. But, I'm trying to find a way to access each excel from a data structure and each tab of that excel file as a separate data frame. Below mentioned is the current code. Any improvisation would be appreciated!! Please let me know if anything else is needed.
#Assigning the path to the folder variable
folder = 'specified_path'
#Getting the list of files from the assigned path
excel_files = [file for file in os.listdir(folder)]
list_of_dfs = []
for file in excel_files :
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None), ignore_index=True)
df['excelfile_name'] = file.split('.')[0]
list_of_dfs.append(df)

I would propose to change the line
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None), ignore_index=True)
to
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None))
df.index = df.index.get_level_values(0)
df.reset_index().rename({'index':'Tab'}, axis=1)

To create a separate dataframe for each tab (with duplicated content) in an Excel file, one could iterate over index level 0 values and index with it:
df = pd.concat(pd.read_excel(filename, sheet_name=None))
list_of_dfs = []
for tab in df.index.get_level_values(0).unique():
tab_df = df.loc[tab]
list_of_dfs.append(tab_df)
For illustration, here is the dataframe content after reading an Excel file with 3 tabs:
After running the above code, here is the content of list_of_dfs:
[ Date Reviewed Adjusted
0 2022-07-11 43 20
1 2022-07-18 16 8
2 2022-07-25 8 3
3 2022-08-01 17 3
4 2022-08-15 14 6
5 2022-08-22 12 5
6 2022-08-29 8 4,
Date Reviewed Adjusted
0 2022-07-11 43 20
1 2022-07-18 16 8
2 2022-07-25 8 3
3 2022-08-01 17 3
4 2022-08-15 14 6
5 2022-08-22 12 5
6 2022-08-29 8 4,
Date Reviewed Adjusted
0 2022-07-11 43 20
1 2022-07-18 16 8
2 2022-07-25 8 3
3 2022-08-01 17 3
4 2022-08-15 14 6
5 2022-08-22 12 5
6 2022-08-29 8 4]

Pivot table with Multi Index Dataframe

I am struggling with how to pivot the dataframe with multi-indexed columns. First i import the data from an .xlsx file and from then i've tried to generate a certain Dataframe.
Note: I'm not allowed to embed images so that's the reason of the links
import pandas as pd
import numpy as np
# Read Excel file
df = pd.read_excel("myFile.xlsx", header=[0])
Output: Click
If you want, here you can see the File: Link to File
# Get Month and Year of the Dataframe
month_year = df.iloc[:, 5:-1].columns
month_list = []
year_list = []
for x in range(len(month_year)-1):
if("Unnamed" not in month_year[x]):
month_list.append(month_year[x].split()[0])
year_list.append(month_year[x].split()[1])
# Read Excel file with headers 1 & 2
df = pd.read_excel(path, header=[0,1])
Output: Click
# Join both indexes excluding the ones with Unnamed
df.columns = [str(x[0] + " " + x[1]) if("Unnamed" not in x[1]) else str(x[0]) for x in df.columns ]
Output: Click
# Adding month and list columns to the DataFrame
df['Month'] = month_list
df['Year'] = year_list
Output: Click
I want the output DataFrame to be like the following:
Desire Output

You should clean it up a bit, because I do not know how the Total column should be handled.
The code below reads the excel file as a MultiIndex, a bit of column names modification, before stacking and extracting the Year and Month columns.
df = pd.read_excel("Downloads/myFile.xlsx", header=[0,1], index_col=[0, 1, 2])
df.index.names = ['Project', 'KPI', 'Metric']
df.columns = df.columns.delete(-1).union([('Total', 'Total')])
df.columns.names = ['Month_Year', 'Values']
(df
.stack(level = 0)
.rename_axis(columns=None)
.reset_index()
.assign(Year = lambda df: df.Month_Year.str.split(" ").str[-1],
Month = lambda df: df.Month_Year.str.split(" ").str[0])
.drop(columns='Month_Year')
)
Project KPI Metric Real Target Total Year Month
0 Project 1 Numeric Project 1 Metric 10.0 30.0 NaN 2019 April
1 Project 1 Numeric Project 1 Metric 651.0 51651.0 NaN 2019 February
2 Project 1 Numeric Project 1 Metric 200.0 215.0 NaN 2019 January
3 Project 1 Numeric Project 1 Metric 2.0 5.0 NaN 2019 March
4 Project 1 Numeric Project 1 Metric NaN NaN 9.0 Total Total
5 Project 2 General Project 2 Metric 20.0 10.0 NaN 2019 April
6 Project 2 General Project 2 Metric 500.0 100.0 NaN 2019 February
7 Project 2 General Project 2 Metric 749.0 12.0 NaN 2019 January
8 Project 2 General Project 2 Metric 1.0 7.0 NaN 2019 March
9 Project 2 General Project 2 Metric NaN NaN 7.0 Total Total
10 Project 3 Numeric Project 3 Metric 30.0 20.0 NaN 2019 April
11 Project 3 Numeric Project 3 Metric 200.0 55.0 NaN 2019 February
12 Project 3 Numeric Project 3 Metric 5583.0 36.0 NaN 2019 January
13 Project 3 Numeric Project 3 Metric 3.0 7.0 NaN 2019 March
14 Project 3 Numeric Project 3 Metric NaN NaN 4.0 Total Total

Getting public Google sheets as panda dataframe

I have been trying to obtain the Statewise sheet from this public googlesheet link as a python dataframe.
The URL of this sheet is different from the URL of other examples for achieving the goal of getting sheet as dataframe, seen on this website.
URL is this :
https://docs.google.com/spreadsheets/d/e/2PACX-1vSc_2y5N0I67wDU38DjDh35IZSIS30rQf7_NYZhtYYGU1jJYT6_kDx4YpF-qw0LSlGsBYP8pqM_a1Pd/pubhtml#
One standard way maybe the following,
import pandas
googleSheetId = '<Google Sheets Id>'
worksheetName = '<Sheet Name>'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheetId,
worksheetName
)
df = pandas.read_csv(URL)
print(df)
But in the present URL I do not see a structure used here. Can someone help to clarify. Thanks.

The Google spreadsheet is actually an html thing. So you should use read_html to load it into a list of pandas dataframes:
dfs = pd.read_html(url, encoding='utf8')
if lxml is available or, if you use BeautifulSoup4:
dfs = pd.read_html(url, flavor='bs4', encoding='utf8')
You will get a list of dataframes and for example dfs[0] is:
0 1 2 3
0 1 id Banner Number_Of_Times
1 2 1 Don't Hoard groceries and essentials. Please e... 2
2 3 2 Be compassionate! Help those in need like the ... 2
3 4 3 Be considerate : While buying essentials remem... 2
4 5 4 Going out to buy essentials? Social Distancing... 2
5 6 5 Plan ahead! Take a minute and check how much y... 2
6 7 6 Plan and calculate your essential needs for th... 2
7 8 7 Help out the elderly by bringing them their gr... 2
8 9 8 Help out your workers and domestic help by not... 2
9 10 9 Lockdown means LOCKDOWN! Avoid going out unles... 1
10 11 10 Panic mode : OFF! ❌ESSENTIALS ARE ON! ✔️ 1
11 12 11 Do not panic! ❌ Your essential needs will be t... 1
12 13 12 Be a true Indian. Show compassion. Be consider... 1
13 14 13 If you have symptoms and suspect you have coro... 1
14 15 14 Stand Against FAKE News and WhatsApp Forwards!... 1
15 16 15 If you have any queries, Reach out to your dis... 1

You can use the following snippet:
from io import BytesIO
import requests
r = requests.get(URL)
data = r.content
df = pd.read_csv(BytesIO(data), index_col=0, error_bad_lines=False)

Writing tabular data to a csv file from a webpage

I've written a script in python to parse some data from a webpage and write it to a csv file via panda. So far what I've written can parse all the tables available in that page but in case of writing to a csv file it is writing the last table from that page to that csv file. Definitely, the data are being overwritten because of the loop. How can I fix this flaw so that my scraper will be able to write all the data from different tables instead of only the last table? Thanks in advance.
import csv
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
df.to_csv("table_item.csv")
print(df)
Btw, I expect to write data to a csv file using panda only. Thanks again.

You can use read_html what return list of DataFrames in webpage, so then need concat for one df:
dfs = pd.read_html('http://www.espn.com/nba/schedule/_/date/20171001')
df = pd.concat(dfs, ignore_index=True)
#if necessary rename columns
d = {'Unnamed: 1':'a', 'Unnamed: 7':'b'}
df = df.rename(columns=d)
print (df.head())
matchup a time (ET) nat tv away tv home tv \
0 Atlanta ATL Miami MIA NaN NaN NaN NaN
1 LA LAC Toronto TOR NaN NaN NaN NaN
2 Guangzhou Guangzhou Washington WSH NaN NaN NaN NaN
3 Charlotte CHA Boston BOS NaN NaN NaN NaN
4 Orlando ORL Memphis MEM NaN NaN NaN NaN
tickets b
0 2,401 tickets available from $6 NaN
1 284 tickets available from $29 NaN
2 2,792 tickets available from $2 NaN
3 2,908 tickets available from $6 NaN
4 1,508 tickets available from $3 NaN
And last to_csv for write to file:
df.to_csv("table_item.csv", index=False)
EDIT:
For learning is possible append each DataFrame to list and then concat:
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
dfs = []
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
print(df)
df.to_csv("table_item.csv")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape zip files into a single dataframe in python - python

Related

Pandas read_excel wrong output

Data Extraction from multiple excel files in pandas dataframe

Pivot table with Multi Index Dataframe

Getting public Google sheets as panda dataframe

Writing tabular data to a csv file from a webpage

Categories

Resources