Dataframe is not aligned properly - python

Im getting data from a rest api, convert it to json and then into a dataframe. I then put that dataframe into a csv file.
The problem is that while it recognizes the column tags correctly, it aligns them 1 to the right because a 0 showed up to the very left.
I know its the count of rows, but how do I stop it from counting OR how would I go about creating one additional column with the "counter" tag.
response_dividends = requests.get(
f"https://sandbox.iexapis.com/stable/stock/aapl/dividends/quote?token={iex_api}")
response_dividends_parsed = json.loads(response_dividends.text)
df = pd.DataFrame(response_dividends_parsed)
df.to_csv("main_data.csv")
the result then looks like this
,amount,currency,declaredDate,description,exDate,flag,frequency,paymentDate,recordDate,refid,symbol,id,key,subkey,updated
0,0.22,USD,2021-04-15,Sydhnrraas Oeir,2021-04-25,Cash,quarterly,2021-05-12,2021-04-27,2239859,AAPL,NDIDDSEIV,LAAP,2243550,1683800492545
the problem is, its not correctly aligned
I opened it in the csv viewer plugin of pycharm and it shows:
wrong aligned

If you set index=False, the row names (which is the count of rows) will not be written to your csv file.
df.to_csv("main_data.csv", index=False)

Related

Pandas - Change the default date format when creating a dataframe from CSVs

I have a script that loops through a folder of CSVs, reads them, removes any empty rows (they all have 'empty' rows that Pandas reads as NaN) and appends them to a master dataframe. It then writes the dataframe to a new CSV. This is all working as expected:
if pl.Path(file).suffix == '.csv':
fullPath = os.path.join(sourceLoc, file)
print(file)
initDF = pd.read_csv(fullPath)
cleanDF = initDF.dropna(subset=['Name'])
masterDF = masterDF.append(cleanDF)
masterDF.to_csv(destLoc, index=False)
My only issue is the input dates are displayed like this 25/05/21 but the output dates end up formatted like this 05/25/21. As I'm in the UK and using a UK version of Excel to analyse the output, it's confusing all my functions.
The only solutions I've found so far are to reformat the date columns individually or style them, which to my understanding only affects how they look in Jupyter and not in the actual data. As there are multiple date columns in the source data files I'd rather not have to reformat them all individually.
Is there any way of defining the date format when first creating the dataframe, or reformatting every date column once the dataframe is filled?
In the end this issue was caused by two different problems.
The first was Excel intermittently exporting my dates in US format despite the original format (and my Windows Region settings) being UK format. I've now added a short VBA loop in my export code to ensure those columns are formatted correctly every time the data is exported.
The second was the CSV date being imported with incorrect dtypes. I suspect this was again the fault of Excel (2010 is problematic) but I'm unsure. I'm now correcting this with an astype() method.
The end result is my dates are now imported into Pandas in the correct format and outputted to a new CSV in the correct format too.

Dataframe to CSV returns one empty column which is visible in the dataframe

After scraping I have put the information in a dataframe and want to export it to a .csv but one of the three columns returns empty in the .csv file ("Content"). This is weird since the all of the three columns are visible in the dataframe, see screenshot.
Screenshot dataframe
Line I use to convert:
df.to_csv('filedestination.csv')
Inspecting the df returns objects:
Inspecting dataframe
Does anyone know how it is possible that the last column, "Content" does not show any data in the .csv file?
Screenshot .csv file
After suggestions it seems that the data is available when opening with .txt. How is it possible that excel does not show the data properly?
Screenshot .txt file data
What is the data type of the Content column?
It is not a string, you can convert that to a string. And then perform df.to_csv
Sometimes, this happens weirdly. View & export will be different. Try Resetting the index before exporting it to .csv/ excel. This always works for me.
df.reset_index()
then,
df.to_csv(r'file location/filename.csv')

Deleting stubborn \r in data frame and creating CSV

I am new in the field, and I am having problems getting rid of a mid-string \r in a pandas data frame that I need to export into a CSV file.
Context: I had a CSV file that I downloaded as a report from the database platform we use in my organization. The report is legible to humans, not to computers, so there is all sort of merging, page breaks, and lots of other formatting. I need to clean it to create a SQL database. One of the columns has an ID number that appears divided into two lines when I see it in Excel:
This is how the original CSV looks when viewed in Excel.
I have tried to delete that separation, but I can't do it. When imported as a DataFrame, Python points out there is an "\r" mid-string - like below:
150043\r35
So this is what I have done:
I imported the CSV file:
df = pd.read_csv("Assessment.csv", header=None)
I attempted this:
df.replace("\r\n","", regex=True)
And this:
df.replace("\r","", regex=True)
After both attempts, it seemed that \r had disappeared in the data frame, like below:
15004335
However, when I create a new CSV, it keeps separating the lines:
This is how it looks even after using the replace function:
In the text editor, it looks like this:
,0,1,6,8,13,15,20,27
0,,,,,Student ID: ,150043
35,,
1,Student:,...
How do I get rid of this permanently? Am I missing something?

Some Hyperlinks not opening with Openpyxl

I have a few hundred files with data and hyperlinks in them that I was trying to upload and append to a single DataFrame when I realized that Pandas was not reading any of the hyperlinks.
I then tried to use Openpyxl to read the hyperlinks in the input Excel files and write a new column into the excels with the text of the hyperlink that hopefully Pandas can read into my dataframe.
However, I am running into issues with my testing the openpyxl code. It is able to read and write some of the hyperlinks but not the others.
My sample file has three rows and looks like this:
My actual data has hyperlinks in the way that I have it for "Google" in my test data set.
The other two hyperlinks in my text data, I inserted by right clicking on the cell and pasting the link.
Sample Test file here: Text.xlsx
Here is the code I wrote to read the hyperlink and paste it in a new column. It works for the first two rows (India and China) but fails for the third row (Google). It's unfortunate because all of my actual data is of that type. Can someone please help me figure it out?
import openpyxl
wb = openpyxl.load_workbook('test.xlsx')
ws = wb.active
column_indices = [1]
max_col = ws.max_column
ws.cell(row=1,column = max_col+1).value = "Hyperlink Text"
for row in range(2,ws.max_row+1):
for col in column_indices:
print(ws.cell(row, column=1).hyperlink.target)
ws.cell(column=max_col+1,row=row).value = ws.cell(row, column=1).hyperlink.target
wb.save('test.xlsx')
The cells where you are using the HYPERLINK function (like google.com) will not be of type hyperlink. You will need to process the cells with HyperLink function using re so similar function.
The values looks like below,
>>> ws.cell(2,1).value
'China'
>>> ws.cell(3,1).value
'India'
>>> ws.cell(4,1).value
'=HYPERLINK("www.google.com","google")'
Suggested code to handle HYPERLINK :
val = ws.cell(row,column).value
if val.find("=HYPERLINK") >= 0 :
hyplink = ws.cell(4,1).value # Or use re module for more robust check
Note : The second for loop to iterate over columns seems not required since you are always using column=1.

Delete entire column with specific content in python using pandas

Have imported an excel sheet in python using pandas now,want to delete entire column with specific content as shown in in the snap shot of content
Here from this image want to delete entire column having content of NAN which represents no data entered, later the content can be used for computation purpose using pandas and graph could be plotted using myplotlib
Is there way to delete entire column based on content not on the base of label
Try this..
s = pd.DataFrame({'1':[1,2,3,4], '2':[np.nan, np.nan, np.nan, np.nan]}) # example DataFrame
s.dropna(axis=1,how='all')
It works fine..
Try this..
for col in s.columns:
if False not in list(np.isnan(s[col])):
del s[col]

Categories