Pandas "usecols" doesn't seems work perfectly

Pandas "usecols" doesn't seems work perfectly - python

I am reading below excel with below python code but not getting any idea why the first column header has ".1" even though set to ignore the first column. Any idea please? many thanks in advance.
python script
import pandas as pd
import os
os.system('cls')
df = pd.read_excel('test_1\Book1.xlsx','sheet1', header=0, skiprows=1, usecols='B:D',index_col= 0, nrows=5)
print(df)
I am very confused about ".1" in the first column name header "PIA IM Equity.1" in the below result

In the file you have two columns named "PIA IM Equity" when reading pandas will rename the second identical column by adding .1 in the name. If you would have a third column with the same name it would have added .2 in it.

Related

Pandas error: "None of [Index(['value'], dtype='object', name='Ticker')] are in the [index]"

Hi I am trying to create multiple csv files from a single big csv using python. The original csv file has multiple stocks data in 1 min date/time with Open, high, low, close, volume as other columns.
Sample data from original file is here
At first, I tried to copy individual Ticker and all its corresponding values to a new file with following code:
import pandas as pd
excel_file_path=r'C:\Users\mahan\Documents\test projects\01_07_APR_WEEKLY_expiry_data_VEGE_NF_AND_BNF_Options_Desktop_Vege.csv'
export_path=r"C:\Users\mahan\Documents\exportfiles\{output_file_name}_sheet.csv"
data= pd.read_csv(excel_file_path, index_col="Ticker") #Making data frame from csv file
rows= data.loc[['NIFTYWK17500CE']] #Retrieving rows by loc method
output_file_name ="NIFTYWK17500CE_"
print(type(rows))
rows
rows.to_csv(export_path)
Result was something like this:
a file was saved with the name "{output_file_name}__sheet.csv"
I failed at naming the file but data was copied pertaining to all the values with Ticker value 'NIFTYWK17500CE'.
Then I tried to create a array with column "Ticker" to find unique values. Created a dataframe with original file for all the data. And tried to use a For loop for values in the array matching the 1st column 'Ticker' and copy those data to a new file using the value in the exporting csv file name.
code as below:
import pandas as pd
excel_file_path=r'C:\Users\mahan\Documents\test projects\01_07_APR_WEEKLY_expiry_data_VEGE_NF_AND_BNF_Options_Desktop_Vege.csv'
df2=pd.read_csv(excel_file_path)
df2_uniques =df2['Ticker'].unique()
df2_counts=df2['Ticker'].value_counts()
for value in df2_uniques:
value=value.replace(' ', '_')
export_path=r"C:\Users\mahan\Documents\exportfiles\{value}__sheet.csv"
df=pd.read_csv(excel_file_path,index_col="Ticker")
rows=df.loc[['value']]
print(type(rows))
rows.to_csv(export_path)
Received an error:
KeyError: "None of [Index(['value'], dtype='object', name='Ticker')] are in the [index]"
Where did I went wrong:
In naming the file properly to save in earlier code.
In the second code.
Any help is really appreciated. Thanks in advance.
SOLVED
What worked for me was the following with comments:
import pandas as pd
excel_file_path=r'C:\Users\mahan\Documents\test projects\01_07_APR_WEEKLY_expiry_data_VEGE_NF_AND_BNF_Options_Desktop_Vege.csv'
df2=pd.read_csv(excel_file_path)
df2_uniques =df2['Ticker'].unique()
for value in df2_uniques:
value=value.replace(' ', '_')
df=pd.read_csv(excel_file_path,index_col="Ticker")
rows=df.loc[[value]] #Changed from 'value' to value
print(type(rows))
rows.to_csv(r'_'+value+'.csv')
#Removed export_path as filename and filepath together were giving me hard time to figure out.
#The files get saved in same filepath as the original imported filepath. So that'll do. sharing just for reference
Final output looks like this:

I can't know for sure without seeing the dataframe, but the error indicates that there is no column name 'Ticker'. It appears that you set this column to be the index, so you can try df2_uniques = set(df2.index).

changed
rows=df.loc[['value']]
to
rows=df.loc[[value]]
Also, Removed export_path as both filename and filepath together were giving me hard time to figure out.
The files get saved in same filepath as the original imported filepath. So that'll do. Sharing just for reference
Final code that worked looked like this:
import pandas as pd
excel_file_path=r'C:\Users\mahan\Documents\test projects\01_07_APR_WEEKLY_expiry_data_VEGE_NF_AND_BNF_Options_Desktop_Vege.csv'
df2=pd.read_csv(excel_file_path)
df2_uniques =df2['Ticker'].unique()
for value in df2_uniques:
value=value.replace(' ', '_')
df=pd.read_csv(excel_file_path,index_col="Ticker")
rows=df.loc[[value]] #Changed from 'value' to value
print(type(rows))
rows.to_csv(r'_'+value+'.csv')

pandas read_excel doesn't read all rows

I have a problem with "pandas read_excel", thats my code:
import pandas as pd
df = pd.read_excel('myExcelfile.xlsx', 'Table1', engine='openpyxl', header=1)
print(df.__len__())
If I run this code in Pycharm on Windows PC I got the right length of the dataframe, which is 28757
but if I run this code on my linux server I got only 26645 as output.
Any ideas whats the reason for that?
Thanks

Try this way:
import pandas as pd
data= pd.read_excel('Advertising.xlsx')
data.head()

I got the solution.
The problem was an empty first row in my .xlsx File.
My file is automatically created by another program, so I used openpyxl to delete the first row and make a new .xlsx File.
import openpyxl
path = 'myExcelFile.xlsx'
book = openpyxl.load_workbook(path)
sheet = book['Tabelle1']
#start at row 0, length 1 row:
sheet.delete_rows(0,1)
#save in new file:
book.save('myExcelFile_new.xlsx')
Attention, in this code sample I don`t check if the first row is empty!
So I delete the first line no matter if there is content in it or not.

How do I prevent pandas from writing a new column when I save to csv

I wrote this code just so show the example that I'm having. I need to save the data I have to a csv then reopen it later but when I reload the data into a pandas dataframe from csv it now has an extra unnamed column at the front that I don't want and it's messing up my data when I try to do .drop_duplicates() because each row now has its own number and every I reopen it from a csv it will have a new row of number at the front, just making everything worse. How do I make it so it doesn't have this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
df.to_csv('data.csv')
print(df.head())
df1 = pd.read_csv('data.csv')
print(df1.head())

Its the dataframe index. You can turn that off with
df.to_csv('data.csv', index=False)
The docs are the first stop to learn the different options you have when writing. pandas.DataFrame.to_csv

While reading, you can prevent columns with empty rows like:
df = pd.read_csv("data.csv").dropna()

The solution was super easy. I needed to do
df.to_csv('data.csv', index= False)

columns missing while reading a huge csv file in jupyter nootebook

So i am trying to read a csv file by the code as below:
import pandas as pd
user_cols = ['id','listing_type','status','listing_class','property_type','street_address','city','state',' 'zip_4','cross_street','street_index','unit','floor','location','Latitude',
'longitude','subway','neighborhood','price','incentives','fee_type','fee_percentage','fee_details_broker',
'fee_details_clients','application_information','maintenance','taxes','max_financing','other_costs','beds',
'baths','full_baths','three_quarter_baths','half_baths','total_rooms','square_feet','exterior_square_feet',
'lot_area','lot_dimensions','date_available','date_listed','closed_on','year_built','recent_renovation',
'lease_min','lease_max','date_added','date_edited','date_update','contact','access','keys','mls_name','mls_id',
'courtesy_of','vow_opt_out','idx_opt_out','pet_details','notes','sync','private','listing_score','added_by_id',
'featured_office_id','date_expires','exclusive_file_id','condition','guarantor','blast_link']
data = pd.read_csv("C:\\Users\\Desktop\\dump-4.csv", low_memory=False, dtype=object, header=None, names=user_cols)
I am able to read the file but when i try to display the columns there are about 15-16 column names that are missing. Why is this happening and what can I do.

So when i deleted the dtype=object and header=None..it did print all the columns. not really sure what wouldve been the correct dtype though! Thanks anyway! :)

saving a dataframe to csv file (python)

I am trying to restructure the way my precipitations' data is being organized in an excel file. To do this, I've written the following code:
import pandas as pd
df = pd.read_excel('El Jem_Souassi.xlsx', sheetname=None, header=None)
data=df["El Jem"]
T=[]
for column in range(1,56):
liste=data[column].tolist()
for row in range(1,len(liste)):
liste[row]=str(liste[row])
if liste[row]!='nan':
T.append(liste[row])
result=pd.DataFrame(T)
result
This code works fine and through Jupyter I can see that the result is good
screenshot
However, I am facing a problem when attempting to save this dataframe to a csv file.
result.to_csv("output.csv")
The resulting file contains the vertical index column and it seems I am unable to call for a specific cell.
(Hopefully, someone can help me with this problem)
Many thanks !!

It's all in the docs.
You are interested in skipping the index column, so do:
result.to_csv("output.csv", index=False)
If you also want to skip the header add:
result.to_csv("output.csv", index=False, header=False)
I don't know how your input data looks like (it is a good idea to make it available in your question). But note that currently you can obtain the same results just by doing:
import pandas as pd
df = pd.DataFrame([0]*16)
df.to_csv('results.csv', index=False, header=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas "usecols" doesn't seems work perfectly - python

In the file you have two columns named "PIA IM Equity" when reading pandas will rename the second identical column by adding .1 in the name. If you would have a third column with the same name it would have added .2 in it.

Related

Pandas error: "None of [Index(['value'], dtype='object', name='Ticker')] are in the [index]"

pandas read_excel doesn't read all rows

How do I prevent pandas from writing a new column when I save to csv

columns missing while reading a huge csv file in jupyter nootebook

saving a dataframe to csv file (python)

Categories

Resources