Pandas read_excel wrong output - python

Update: so I'm messing around with the two files, I copied the data from DOB column from the 2nd file to the 1st file to make the files look visually identical. However I notice some really interesting behaviour when using Ctrl+F in Microsoft Excel. When I leave the search box blank in the first file it finds no matches. However when I repeat the same operation with the 2nd file it finds 21 matches from each cell between E1 to G7. I suppose somehow there are some blank/invisible cells in the 2nd file - and that's what's causing read_excel to behave differently.
My goal is to simply execute the pandas read_excel function. However, I'm running into a very strange situation in which I am trying to run pandas.read_excel on two very similar excel files but am getting substantially different results.
Code
import pandas
data1 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_1.xlsx')
print(data1)
data2 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_2.xlsx')
print(data2)
Output
Name DOB Class Year GPA
0 Redeye438 2008-09-22 Fresh 1
1 Redeye439 2009-09-20 Soph 2
2 Redeye440 2010-09-22 Junior 3
3 Redeye441 2011-09-20 Senior 4
4 Redeye442 2008-09-20 Fresh 4
5 Redeye443 2009-09-22 Soph 3
Name DOB Class Year GPA
Redeye438 2011-09-20 Fresh 1 NaN NaN NaN
Redeye439 2010-09-22 Soph 2 NaN NaN NaN
Redeye440 2009-09-20 Junior 3 NaN NaN NaN
Redeye441 2008-09-22 Senior 4 NaN NaN NaN
Redeye442 2011-09-22 Fresh 4 NaN NaN NaN
Redeye443 2010-09-20 Soph 3 NaN NaN NaN
Why are the columns mapped incorrectly for data2?
The excel files in question (the only difference is the data in the DOB column):
Excel file downloads

It looks like you're using an older version of Pandas because I can't reproduce the issue on the latest version.
import pandas as pd
pd.show_versions()
INSTALLED VERSIONS
------------------
python : 3.10.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
pandas : 1.5.0
As you can see below, the columns of data2 are mapped correctly.
import pandas as pd
data1 = pd.read_excel(r"C:\Users\abokey\Downloads\test_1.xlsx")
print(data1)
data2 = pd.read_excel(r"C:\Users\abokey\Downloads\test_2.xlsx")
print(data2)
Name DOB Class Year GPA
0 Redeye438 2008-09-22 Fresh 1
1 Redeye439 2009-09-20 Soph 2
2 Redeye440 2010-09-22 Junior 3
3 Redeye441 2011-09-20 Senior 4
4 Redeye442 2008-09-20 Fresh 4
5 Redeye443 2009-09-22 Soph 3
Name DOB Class Year GPA
0 Redeye438 2011-09-20 Fresh 1
1 Redeye439 2010-09-22 Soph 2
2 Redeye440 2009-09-20 Junior 3
3 Redeye441 2008-09-22 Senior 4
4 Redeye442 2011-09-22 Fresh 4
5 Redeye443 2010-09-20 Soph 3
However, you're right because the format of the two Excel files are not the same. In fact, when looking at test2.xlsx more carefully, this one seems to carry 7x3 blank cells (rows/cols). The latest version of pandas seems to handle this kind of Excel file since the empty cells are ignored when calling pandas.read_excel.
So in principle, upgrading your pandas version should fix the problem :
pip install --upgrade pandas
If the problem persists, clean your Excel file test2.xlsx by doing this :
Open up the file in Excel
Click on Find & Select
Choose Go To Special and Blanks
Click on Clear All
Save your changes

Related

why is pd.crosstab not giving the expected output in python pandas?

I have a 2dataframes, which I am calling as df1 and df2.
df1 has columns like KPI and context and it looks like this.
KPI Context
0 Does the company have a policy in place to man... Anti-Bribery Policy\nBroadridge does not toler...
1 Does the company have a supplier code of conduct? Vendor Code of Conduct Our vendors play an imp...
2 Does the company have a grievance/complaint ha... If you ever have a question or wish to report ...
3 Does the company have a human rights policy ? Human Rights Statement of Commitment Broadridg...
4 Does the company have a policies consistent wi... Anti-Bribery Policy\nBroadridge does not toler...
df2 has a single column 'keyword'
df2:
Keyword
0 1.5 degree
1 1.5°
2 2 degree
3 2°
4 accident
I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.
for which I have used pd.crosstab() however I suspect that its not giving me the expected output.
here's what I have tried so far.
new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())
the new_df looks like this.
KPI 1.5 degree 1.5° \
0 Does the Supplier code of conduct cover one or... NaN NaN
1 Does the companies have sites/operations locat... NaN NaN
2 Does the company have a due diligence process ... NaN NaN
3 Does the company have a grievance/complaint ha... NaN NaN
4 Does the company have a grievance/complaint ha... NaN NaN
2 degree 2° accident
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 1.0 NaN NaN
4 NaN NaN NaN
The expected output which I want is something like this.
0 KPI 1.5 degree 1.5° 2 degree 2° accident
1 Does the company have a policy in place to man 44 2 3 5 9
what exactly am I missing? please let me know, thanks!
There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:
import re
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])
df1['new'] = df1['Context'].str.findall(pat, flags=re.I)
new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])

How to scrape zip files into a single dataframe in python

I am very new to web scrapping and I am trying to understand how I can scrape all the zip files and regular files that are on this website. The end goal is the scrape all the data, I was originally thinking I could use pd.read_html and feed in a list of each link and loop through each zip file.
I am very new to web scraping so any help at all would be very useful, I have tried a few examples this far please see the below code
import pandas as pd
pd.read_html("https://www.omie.es/en/file-access-list?parents%5B0%5D=/&parents%5B1%5D=Day-ahead%20Market&parents%5B2%5D=1.%20Prices&dir=%20Day-ahead%20market%20hourly%20prices%20in%20Spain&realdir=marginalpdbc",match="marginalpdbc_2017.zip")
So this is what I would like the output to look like except each zip file would need to be its own data frame to work with/loop through. Currently, all it seems to be doing is downloading all the names of the zip files, not the actual data.
Thank you
To open a zipfile and read the files there to a dataframe you can use next example:
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
zip_url = "https://www.omie.es/es/file-download?parents%5B0%5D=marginalpdbc&filename=marginalpdbc_2017.zip"
dfs = []
with ZipFile(BytesIO(requests.get(zip_url).content)) as zf:
for file in zf.namelist():
df = pd.read_csv(
zf.open(file),
sep=";",
skiprows=1,
skipfooter=1,
engine="python",
header=None,
)
dfs.append(df)
final_df = pd.concat(dfs)
# print first 10 rows:
print(final_df.head(10).to_markdown(index=False))
Prints:
0
1
2
3
4
5
6
2017
1
1
1
58.82
58.82
nan
2017
1
1
2
58.23
58.23
nan
2017
1
1
3
51.95
51.95
nan
2017
1
1
4
47.27
47.27
nan
2017
1
1
5
46.9
45.49
nan
2017
1
1
6
46.6
44.5
nan
2017
1
1
7
46.25
44.5
nan
2017
1
1
8
46.1
44.72
nan
2017
1
1
9
46.1
44.22
nan
2017
1
1
10
45.13
45.13
nan

How to import tables from multiple pdfs into a single data frame using python?

I'm using the tabula package in python 3 to get data from tables in pdfs.
I am trying to import tables from multiple pdfs online (e.g. http://trreb.ca/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf), but I am having trouble even getting one table imported properly.
Here is the code that I have run:
! pip install -q tabula-py
! pip install pandas
import pandas as pd
import tabula
from tabula import read_pdf
pdf = "http://trebhome.com/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf"
data = read_pdf(pdf, output_format='dataframe', pages="all")
data
which gives the following output:
[ Community Sales Dollar Volume ... Active Listings Avg. SP/LP Avg. DOM
0 Ajax 391 $265,999,351 ... 73 100% 21
1 Central East 32 $21,177,488 ... 3 99% 26
2 Northeast Ajax 70 $50,713,199 ... 18 100% 21
3 South East 105 $68,203,487 ... 15 100% 20
[4 rows x 9 columns]]
Which seems to work, except that it has missed every other row after "Central East". Here is the actual table in question, from the pdf at the url in the code above:
Ajax Q4 2019
I have also tried fiddling with some of the options in the read_pdf function, with minimal results.
The end goal will be a script that loops through all these "Community Reports" (there are quite a few), pulling all such tables from the pdfs, and consolidating them into one dataframe in python for analysis.
If the question isn't clear, or more info is needed, please let me know! I'm new to both python and stack exchange, so apologies if I'm not framing things correctly.
And of course any help would be greatly appreciated!
Bryn
The following code almost worked:
pdf = "http://trebhome.com/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf"
from tabula import convert_into
convert_into(pdf, "test.csv", pages="all", lattice="true")
with open("test.csv",'r') as f:
with open("updated_test.csv",'w') as f1:
next(f) # skip header line
for line in f:
f1.write(line)
data = pd.read_csv("updated_test.csv")
# rename first column, drop unwanted rows
data.rename(columns = {'Unnamed: 0':'Community'}, inplace=True)
data.dropna(inplace=True)
data
and gives output:
Community Year Quarter Sales Dollar Volume Average Price Median Price New Listings Active Listings Avg. SP/LP
1 Central 2019 Q4 44.0 $27,618,950 $627,703 $630,500 67.0 8.0 99%
2 Central East 2019 Q4 32.0 $21,177,488 $661,797 $627,450 34.0 3.0 99%
3 Central West 2019 Q4 57.0 $40,742,450 $714,780 $675,000 65.0 7.0 99%
4 Northeast Ajax 2019 Q4 70.0 $50,713,199 $724,474 $716,500 82.0 18.0 100%
5 Northwest Ajax 2019 Q4 49.0 $37,192,790 $759,037 $765,000 63.0 14.0 99%
6 South East 2019 Q4 105.0 $68,203,487 $649,557 $640,000 117.0 15.0 100%
7 South West 2019 Q4 34.0 $20,350,987 $598,558 $590,000 36.0 8.0 99%
the only issue here is the last column, "Avg. DOM", wasn't picked up by the convert_into command.
For my analysis this doesn't matter, but it could definitely be an issue for others trying to pull tables in a similar manner.

Write value in next available cell csv

I have a code of writing peoples names, ages and scores for a quiz that I made. I simplified the code to write the names and ages together and not separately but I cant write the score with the names as they are in separate parts of the code. The CSV file looks like this
name, age, score
Alfie, 15, 20
Michael, 16, 19
Alfie, 15, #After I simplified
Dylan, 16,
As you can see i don't know how to write a value in the 3rd column. Does anyone know how to write a value into the next available cell in a CSV file in the column 2. I'm new to programming so any help would be greatly appreciated.
Michael
This is your data:
df = pd.DataFrame({'name':['Alfie','Michael','Alfie','Dylan'], 'age':[15,16,15,16], 'score':[20,19,None,None]})
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 nan
3 Dylan 16 nan
if you need read csv to pandas then use:
import pandas as pd
df = pd.read_csv('Your_file_name.csv')
I suggest two ways to solve your problem:
df.fillna(0, inplace=True) fill all (this example fill 0).
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 0.0
3 Dylan 16 0.0
df.loc[2,'score'] = 22 fill specific cells
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 22.0
3 Dylan 16 nan
If, after that you need write your fixed data to csv, the use:
df.to_csv('New_name.csv', sep=',', header=0)

How to identity non-empty columns in pandas dataframe?

I'm processing a dataset with around 2000 columns and I noticed many of them are empty. I want to know specifically how many of them are empty and how many are not. I use the following code:
df.isnull().sum()
I will get the number of empty rows in each columns. However, given I'm investigating about 2000 columns with 7593 rows, the output in IPython looks like the following:
FEMALE_DEATH_YR3_RT 7593
FEMALE_COMP_ORIG_YR3_RT 7593
PELL_COMP_4YR_TRANS_YR3_RT 7593
PELL_COMP_2YR_TRANS_YR3_RT 7593
...
FIRSTGEN_YR4_N 7593
NOT1STGEN_YR4_N 7593
It doesn't show all the columns because it has too many columns. Thus it makes it very difficult to tell how many columns are all empty and how any are not.
I wonder is there anyway to allow me to identify the non-empty columns quickly? Thanks!
to find the number of non empty columns:
len(df.columns) - len(df.dropna(axis=1,how='all').columns)
3
df
Country movie name rating year Something
0 thg John 3 NaN NaN NaN
1 thg Jan 4 NaN NaN NaN
2 mol Graham lob NaN NaN NaN
df=df.dropna(axis=1,how='all')
Country movie name
0 thg John 3
1 thg Jan 4
2 mol Graham lob

Categories