How to correctly read specific csv column - python

Hey everyone my question is kinda silly but i am new to python)
I am writing a python script for c# aplication and i got kinda strange issue when i work with csv document.
When i open it it and work with Date column it works fine
df=pd.read_csv("../Debug/StockHistoryData.csv")
df = df[['Date']]
But when i try to work with another columns it throws error
df = df[['Close/Last']]
KeyError: "None of [Index(['Close/Last'], dtype='object')] are in the [columns]"
It says there are no such Index but but when i print the whole table it works fine and shows all columns
Table Image
Error image

Take a look at the first row of your CSV file.
It contains colum names (comma-separated).
From time to time it happens that this initial line contains a space after
each comma.
For a human being it is quite readable and even intuitive.
But read_csv adds these spaces to column names, what is sometimes difficult
to discover.
Another option is to run print(df.columns) after you read your file.
Then look for any extra spaces in column names.

Related

Why Doesn't Python Recognize the Column Name (KeyError)

I imported stock/options data into a data frame and want to use pandas to manually filter for specific criteria. I renamed a few columns and then later on I tried to do a bit of cleaning so I can work with the data.
I tried to replace percentage signs then convert the data type to a float by doing this:
df = df['IV'].str.rstrip("%").astype(float)
df = df['IV_Rank'].str.rstrip("%").astype(float)/100
df = df['IV PCT'].str.rstrip("%").astype(float)/100
When I run that code I get the error message: KeyError: 'IV'. I got this error for the other columns as well when I tried to run them each independently but I tried copy then pasting the column name as well as trying the old names. I am not too sure what to do but some help would be appreciated
That's because you are overwriting the entire dataframe. This is what I think you are trying to do
df['IV'] = df['IV'].str.rstrip("%").astype(float)
df['IV_Rank'] = df['IV_Rank'].str.rstrip("%").astype(float)/100
df['IV PCT'] = df['IV PCT'].str.rstrip("%").astype(float)/100

Pandas read csv only returns the first column when column names are duplicate

I have OHLC data in a .csv file with the stock name is repeated in the header rows, like this:
M6A=F, M6A=F,M6A=F, M6A=F, M6A=F
Open, High, Low, Close, Volume
I am using pandas read_csv to get it, and parse all (and only) the 'M6A=F' columns to FastAPI. So far nothing I do will get all the columns. I either get the first column if I filter with "usecols=" or the last column if I filter with "names=".
I don't want to load the entire .csv file then dump unwanted data due to speed of use, so need to filter before extracting the data.
Here is my code example:
symbol = ['M6A=F']
df = pd.read_csv('myOHCLVdata.csv', skipinitialspace=True, usecols=lambda x: x in symbol)
def parse_csv(df):
res = df.to_json(orient="records")
parsed = json.loads(res)
return parsed
#app.get("/test")
def historic():
return parse_csv(df)
What I have done so far:
I checked the documentation for pandas.read_csv and it says "names=" will not allow duplicates.
I use lambdas in the above code to prevent the symbol hanging FastAPI if it does not match a column.
My understanding from other stackoverflow questions on this is that mangle_dupe_cols=True should be incrementing the duplicates with M6A=F.1, M6A=F.2, M6A=F.3 etc... when pandas reads it into a dataframe, but that isnt happening and I tried setting it to false, but it says it is not implemented yet.
And answers like I found in this stackoverflow solution dont seem to tally with what is happening in my code, since I am only getting the first column returned, or the last column with the others over-written. (I included FastAPI code here as it might be related to the issue or a workaround).

Way to refer a column within a same name under difference merged cell?

im kinda new to pandas and stuck at how to refer a column within same name under different merged column. here some example which problem im stuck about. i wanna refer a database from worker at company C. but if im define this excel as df and
dfcompanyAworker=df[Worker]
it wont work
is there any specific way to define a database within identifical column like this ?
heres the table
https://i.stack.imgur.com/8Y6gp.png
thanks !
first read the dataset that will be used, then set the shape for example I use excel format
dfcompanyAworker = pd.read_excel('Worker', skiprows=1, header=[1,2], index_col=0, skipfooter=7)
dfcompanyAworker
where:
skiprows=1 to ignore the title row in the data
header=[1, 2] is a list because we have multilevel columns, namely Category (Company) and other data
index_col=0 to make the Date column an ​​index for easier processing and analysis
skipfooter=7 to ignore the footer at the end of the data line
You can follow or try the steps as I made the following

How to put the dates in chronological order in python CSV

So I am working on processing some data and after running my file, everything is fine except that the date are not in order. I used the code below to try putting them in order but didnt work. by the way 'updated_at' is the column I am trying to put in chronological
df = df.sort_values(by=["updated_at"], ascending=True)
Please let me know how I can make this work. I have attached a picture to for better understanding of my question.
"updated_at" column pic
We are missing a bit of the context, but it could be that the column "updated_at" is not a datetime column, but a simple string, since it looks to me it is sorted alfabetically. Check with df["updated_at"].dtype and if it's not a datetime type (it will probably be of type "object") then do
df["update_at"] = pd.to_datetime(df["update_at"])

Now gets this new Exception: cannot handle a non-unique multi-index

There is similar questions asked but i didnt get it to work. So here goes. (This used to work fine up until recently so I dont know if python update messed this up or not).
"data" is a panda dataFrame.
data = data[data['plant name'].notnull()]
data = data[data['plant name'] != '0']
When I use my full range of data I get the "Exception: cannot handle a non-unique multi-index!". This may be related to several columns having empty names in the excel i read this from (in spyder when I look at the data these have "NaN" for name. I thought this would work as long as I dont have several column name that are the name i'm filtering by (in this case 'plant name'). But if I reduce the data to only onclude the first 3 columns it doesn't give me exception.
The second problem is these 2 lines of code used to remove all rows where the 'plant name' was '0' or empty. They dont do anything anymore (even if i get them to not crash by removing the columns mentioned above).
Thank you!

Categories