How to correct DataFrame header containing whitespaces.

How to correct DataFrame header containing whitespaces. - python

I am new to python and actually started with R.
My problem is that I am unable to debug key errors from my pandas dataframes. Here is part of the code:
I read in a data frame from excel with following commands.
cwd = os.getcwd()
os.chdir(directorytofile)
os.listdir('.')
file = dataset
xl = pd.ExcelFile(file)
df1 = cl.parse('Sheet1')
Now when i want to select a header with a blank space like
Lieferung angelegt am
(It's German sorry for that)
I get the key error. I tried different ways to delete blank spaces in my headers when building the dataframe like:
sep='\s*,\s*'
But it still occurs. Is there a way for me to see where the problems happen?
Obviously its about the blank spaces because for headers without everything works fine.

Related

Picking out a specific column in a table

My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.

Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.

The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'

Dataframe is not aligned properly

Im getting data from a rest api, convert it to json and then into a dataframe. I then put that dataframe into a csv file.
The problem is that while it recognizes the column tags correctly, it aligns them 1 to the right because a 0 showed up to the very left.
I know its the count of rows, but how do I stop it from counting OR how would I go about creating one additional column with the "counter" tag.
response_dividends = requests.get(
f"https://sandbox.iexapis.com/stable/stock/aapl/dividends/quote?token={iex_api}")
response_dividends_parsed = json.loads(response_dividends.text)
df = pd.DataFrame(response_dividends_parsed)
df.to_csv("main_data.csv")
the result then looks like this
,amount,currency,declaredDate,description,exDate,flag,frequency,paymentDate,recordDate,refid,symbol,id,key,subkey,updated
0,0.22,USD,2021-04-15,Sydhnrraas Oeir,2021-04-25,Cash,quarterly,2021-05-12,2021-04-27,2239859,AAPL,NDIDDSEIV,LAAP,2243550,1683800492545
the problem is, its not correctly aligned
I opened it in the csv viewer plugin of pycharm and it shows:
wrong aligned

If you set index=False, the row names (which is the count of rows) will not be written to your csv file.
df.to_csv("main_data.csv", index=False)

Deleting stubborn \r in data frame and creating CSV

I am new in the field, and I am having problems getting rid of a mid-string \r in a pandas data frame that I need to export into a CSV file.
Context: I had a CSV file that I downloaded as a report from the database platform we use in my organization. The report is legible to humans, not to computers, so there is all sort of merging, page breaks, and lots of other formatting. I need to clean it to create a SQL database. One of the columns has an ID number that appears divided into two lines when I see it in Excel:
This is how the original CSV looks when viewed in Excel.
I have tried to delete that separation, but I can't do it. When imported as a DataFrame, Python points out there is an "\r" mid-string - like below:
150043\r35
So this is what I have done:
I imported the CSV file:
df = pd.read_csv("Assessment.csv", header=None)
I attempted this:
df.replace("\r\n","", regex=True)
And this:
df.replace("\r","", regex=True)
After both attempts, it seemed that \r had disappeared in the data frame, like below:
15004335
However, when I create a new CSV, it keeps separating the lines:
This is how it looks even after using the replace function:
In the text editor, it looks like this:
,0,1,6,8,13,15,20,27
0,,,,,Student ID: ,150043
35,,
1,Student:,...
How do I get rid of this permanently? Am I missing something?

Some Hyperlinks not opening with Openpyxl

I have a few hundred files with data and hyperlinks in them that I was trying to upload and append to a single DataFrame when I realized that Pandas was not reading any of the hyperlinks.
I then tried to use Openpyxl to read the hyperlinks in the input Excel files and write a new column into the excels with the text of the hyperlink that hopefully Pandas can read into my dataframe.
However, I am running into issues with my testing the openpyxl code. It is able to read and write some of the hyperlinks but not the others.
My sample file has three rows and looks like this:
My actual data has hyperlinks in the way that I have it for "Google" in my test data set.
The other two hyperlinks in my text data, I inserted by right clicking on the cell and pasting the link.
Sample Test file here: Text.xlsx
Here is the code I wrote to read the hyperlink and paste it in a new column. It works for the first two rows (India and China) but fails for the third row (Google). It's unfortunate because all of my actual data is of that type. Can someone please help me figure it out?
import openpyxl
wb = openpyxl.load_workbook('test.xlsx')
ws = wb.active
column_indices = [1]
max_col = ws.max_column
ws.cell(row=1,column = max_col+1).value = "Hyperlink Text"
for row in range(2,ws.max_row+1):
for col in column_indices:
print(ws.cell(row, column=1).hyperlink.target)
ws.cell(column=max_col+1,row=row).value = ws.cell(row, column=1).hyperlink.target
wb.save('test.xlsx')

The cells where you are using the HYPERLINK function (like google.com) will not be of type hyperlink. You will need to process the cells with HyperLink function using re so similar function.
The values looks like below,
>>> ws.cell(2,1).value
'China'
>>> ws.cell(3,1).value
'India'
>>> ws.cell(4,1).value
'=HYPERLINK("www.google.com","google")'
Suggested code to handle HYPERLINK :
val = ws.cell(row,column).value
if val.find("=HYPERLINK") >= 0 :
hyplink = ws.cell(4,1).value # Or use re module for more robust check
Note : The second for loop to iterate over columns seems not required since you are always using column=1.

pd.read_excel does recognize the file but does not actually read it

I've been busy working on some code and one part of it is importing an excel file. I've been using the code below. Now, on one pc it works but on another it does not (I did change the paths though). Python does recognize the excel file and does not give an error when loading, but when I print the table it says:
Empty DataFrame
Columns: []
Index: []
Just to be sure, I checked the filepath which seems to be correct. I also checked the sheetname but that is all good too.
df = pd.read_excel(book_filepath, sheet_name='Potentie_alles')
description = df["#"].map(str)
The key error '#' (# is the header of the first column of the sheet).
Does anyone know how to fix this?
Kind regards,
iCookieMonster

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to correct DataFrame header containing whitespaces. - python

Related

Picking out a specific column in a table

Dataframe is not aligned properly

Deleting stubborn \r in data frame and creating CSV

Some Hyperlinks not opening with Openpyxl

pd.read_excel does recognize the file but does not actually read it

Categories

Resources