There is similar questions asked but i didnt get it to work. So here goes. (This used to work fine up until recently so I dont know if python update messed this up or not).
"data" is a panda dataFrame.
data = data[data['plant name'].notnull()]
data = data[data['plant name'] != '0']
When I use my full range of data I get the "Exception: cannot handle a non-unique multi-index!". This may be related to several columns having empty names in the excel i read this from (in spyder when I look at the data these have "NaN" for name. I thought this would work as long as I dont have several column name that are the name i'm filtering by (in this case 'plant name'). But if I reduce the data to only onclude the first 3 columns it doesn't give me exception.
The second problem is these 2 lines of code used to remove all rows where the 'plant name' was '0' or empty. They dont do anything anymore (even if i get them to not crash by removing the columns mentioned above).
Thank you!
Related
I have OHLC data in a .csv file with the stock name is repeated in the header rows, like this:
M6A=F, M6A=F,M6A=F, M6A=F, M6A=F
Open, High, Low, Close, Volume
I am using pandas read_csv to get it, and parse all (and only) the 'M6A=F' columns to FastAPI. So far nothing I do will get all the columns. I either get the first column if I filter with "usecols=" or the last column if I filter with "names=".
I don't want to load the entire .csv file then dump unwanted data due to speed of use, so need to filter before extracting the data.
Here is my code example:
symbol = ['M6A=F']
df = pd.read_csv('myOHCLVdata.csv', skipinitialspace=True, usecols=lambda x: x in symbol)
def parse_csv(df):
res = df.to_json(orient="records")
parsed = json.loads(res)
return parsed
#app.get("/test")
def historic():
return parse_csv(df)
What I have done so far:
I checked the documentation for pandas.read_csv and it says "names=" will not allow duplicates.
I use lambdas in the above code to prevent the symbol hanging FastAPI if it does not match a column.
My understanding from other stackoverflow questions on this is that mangle_dupe_cols=True should be incrementing the duplicates with M6A=F.1, M6A=F.2, M6A=F.3 etc... when pandas reads it into a dataframe, but that isnt happening and I tried setting it to false, but it says it is not implemented yet.
And answers like I found in this stackoverflow solution dont seem to tally with what is happening in my code, since I am only getting the first column returned, or the last column with the others over-written. (I included FastAPI code here as it might be related to the issue or a workaround).
The problem
I'm working with a data set, which has been given to me as a csv file with lines of the form id,data. I would like to work with this data in a pandas dataframe, with the id as the index.
Unfortunately, somewhere along the data pipeline, my csv files has a number of rows where the id's are missing. Fortunately, the rows of my data are not fully independent, so I can recreate the missing values: each row is linked to its predecessor, and I have access to an oracle which, when given an id, can give me all its data. This includes the id of its predecessor.
My question is therefore whether there's a simple way of filling in these missing values in my dataframe
My Solution
I don't have much experience working with pandas, but after playing around for a bit I came up with the following approach. I start by reading the csv file into a dataframe without setting the index, so I end up with a RangeIndex. I then
Find the location of the rows with missing ids
Add 1 to the index to get the children of each row
Ask the oracle for the parents of each child
Merge the parents and children on the child id
Subtract one from the index again, and set the parent ids
In code:
children = df.loc[df[df['id'].isna()].index + 1, 'id']
parents = pd.Series({x.id: x.parent_id for x in ask_oracle(children)},
name='parent_id')
combined = pd.merge(children, parents, left_on='id', right_index=True)
combined.set_index(combined.index - 1, inplace=True)
df.loc[combined.index, 'id'] = combined['parent_id']
This works, but I'm 95% sure it's going to look like scary black magic in a few months time.
In particular, I'm unhappy with
The way I get the location of the nan rows. Three lots of df[ in one line is just too many
The manual fiddling about with the indices I have to do to get the rows to match up.
Does anyone have any suggestions for a better way of doing things?
The format of the input data is fixed, as are the properties of the oracle, but if there's a smarter way of organising my dataframe, I'm more than happy to hear it.
Hey everyone my question is kinda silly but i am new to python)
I am writing a python script for c# aplication and i got kinda strange issue when i work with csv document.
When i open it it and work with Date column it works fine
df=pd.read_csv("../Debug/StockHistoryData.csv")
df = df[['Date']]
But when i try to work with another columns it throws error
df = df[['Close/Last']]
KeyError: "None of [Index(['Close/Last'], dtype='object')] are in the [columns]"
It says there are no such Index but but when i print the whole table it works fine and shows all columns
Table Image
Error image
Take a look at the first row of your CSV file.
It contains colum names (comma-separated).
From time to time it happens that this initial line contains a space after
each comma.
For a human being it is quite readable and even intuitive.
But read_csv adds these spaces to column names, what is sometimes difficult
to discover.
Another option is to run print(df.columns) after you read your file.
Then look for any extra spaces in column names.
import pandas as pd
DATA = pd.read_csv(url)
DATA.head()
I have a large dataset that have dozens of columns. After loading it like above into Colab, I can see the name of each column. But running DATA.columns just return Index([], dtype='object'). What's happening in this?
Now I find it impossible to pick out a few columns without column names. One way is to specify names = [...] when I load it, but I'm reluctant to do that since there're too many columns. So I'm looking for a way to index a column by integers, like in R df[:,[1,2,3]] would simply give me the first three columns of a dataframe. Somehow Pandas seems to focus on column names and makes integer indexing very inconvenient, though.
So what I'm asking is (1) What did I do wrong? Can I obtain those column names as well when I load the dataframe? (2) If not, how can I pick out the [0, 1, 10]th column by a list of integers?
It seems that the problem is in the loading as DATA.shape returns (10000,0). I rerun the loading code a few times, and all of a sudden, things go back normal. Maybe Colab was taking a nap or something?
You can perfectly do that using df.loc[:,[1,2,3]] but i would suggest you to use the names because if the columns ever change the order or you insert new columns, the code can break it.
I'm new with Pandas so this is basic question. I created a Dataframe by concatenating two previous Dataframes. I used
todo_pd = pd.concat([rabia_pd, capitan_pd], keys=['Rabia','Capitan'])
thinking that in the future I could separate them easily and saving each one to a different location. Right now I'm being unable to do this separation using the keys I defined with the concat function.
I've tried simple things like
half_dataframe = todo_pd['Rabia']
but it throws me an error saying that there is a problem with the key.
I've also tried with other options I've found in SO, like using the
_get_values('Rabia'),or the.index._get_level_values('Rabia')features, but they all throw me different errors regarding that it does not recognize a string as a way to access the information, or that it requires positional argument: 'level'
The whole Dataframe contains about 22 columns, and I just want to retrieve from the "big dataframe" the part indexed as 'Rabia' and the part index as 'Capitan'.
I'm sure it has a simple solution that I'm not getting for my lack of practice with Pandas.
Thanks a lot,
Use DataFrame.xs:
df1 = todo_pd.xs('Rabia')
df2 = todo_pd.xs('Capitan')