New to Pandas, so maybe I'm missing a big idea?
I have a Pandas DataFrame of register transactions with shape like (500,4):
Time datetime64[ns]
Net Total float64
Tax float64
Total Due float64
I'm working through my code in a Python3 Jupyter notebook. I can't get past sorting any column. Working through the different code examples for sort, I'm not seeing the output reorder when I inspect the df. So, I've reduced the problem to trying to order just one column:
df.sort_values(by='Time')
# OR
df.sort_values(['Total Due'])
# OR
df.sort_values(['Time'], ascending=True)
No matter which column title, or which boolean argument I use, the displayed results never change order.
Thinking it could be a Jupyter thing, I've previewed the results using print(df), df.head(), and HTML(df.to_html()) (the last example is for Jupyter notebooks). I've also rerun the whole notebook from import CSV to this code. And, I'm also new to Python3 (from 2.7), so I get stuck with that sometimes, but I don't see how that's relevant in this case.
Another post has a similar problem, Python pandas dataframe sort_values does not work. In that instance, the ordering was on a column type string. But as you can see all of the columns here are unambiguously sortable.
Why does my Pandas DataFrame not display new order using sort_values?
df.sort_values(['Total Due']) returns a sorted DF, but it doesn't update DF in place.
So do it explicitly:
df = df.sort_values(['Total Due'])
or
df.sort_values(['Total Due'], inplace=True)
My problem, fyi, was that I wasn't returning the resulting dataframe, so PyCharm wasn't bothering to update said dataframe. Naming the dataframe after the return keyword fixed the issue.
Edit:
I had return at the end of my method instead of
return df,
which the debugger must of noticed, because df wasn't being updated in spite of my explicit, in-place sort.
Related
I am using pandas 1.0.1 and I am creating a new column that converts the date column to a datetime column and I am getting the warning below. I tried using data.loc[:, "Datetime"] as well and I still got the same warning. Please how could this be avoided?
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data["Datetime"] = pd.to_datetime(data["Date"], infer_datetime_format=True)
Most likely you created your source DataFrame as a view of another
DataFrame (only some columns and / or only some rows).
Find in your code the place where your DataFrame is created and append .copy() there.
Then your DataFrame will be created as a fully independent DataFrame (with its
own data buffer) and this warning should not appear any more.
import pandas as pd
DATA = pd.read_csv(url)
DATA.head()
I have a large dataset that have dozens of columns. After loading it like above into Colab, I can see the name of each column. But running DATA.columns just return Index([], dtype='object'). What's happening in this?
Now I find it impossible to pick out a few columns without column names. One way is to specify names = [...] when I load it, but I'm reluctant to do that since there're too many columns. So I'm looking for a way to index a column by integers, like in R df[:,[1,2,3]] would simply give me the first three columns of a dataframe. Somehow Pandas seems to focus on column names and makes integer indexing very inconvenient, though.
So what I'm asking is (1) What did I do wrong? Can I obtain those column names as well when I load the dataframe? (2) If not, how can I pick out the [0, 1, 10]th column by a list of integers?
It seems that the problem is in the loading as DATA.shape returns (10000,0). I rerun the loading code a few times, and all of a sudden, things go back normal. Maybe Colab was taking a nap or something?
You can perfectly do that using df.loc[:,[1,2,3]] but i would suggest you to use the names because if the columns ever change the order or you insert new columns, the code can break it.
Python script does not run while executing in a bat file, but runs seamlessly on the editor.
The error is related to datatype difference in pd.merge script. Although the datatype given to both the columns is same in both the dataframes.
df2a["supply"] = df2a["supply"].astype(str)
df2["supply_typ"] = df2["supply_typ"].astype(str)
df2a["supply_typ"] = df2a["supply_typ"].astype(str)
df = (pd.merge(df2,df2a, how=join,on=
['entity_id','pare','grome','buame','tame','prd','gsn',
'supply','supply_typ'],suffixes=['gs2','gs2x']))
While running the bat file i am getting following error in pd.merge:
You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat
Not a direct answer, but contains code that cannot be formatted in a comment, and should be enough to solve the problem.
When pandas says that you are trying to merge on float64 and object columns, it is certainly right. It may not be evident because pandas relies on numpy, and that a numpy object column can store any data.
I ended with a simple function to diagnose all those data type problem:
def show_types(df):
for i,c in enumerate(df.columns):
print(df[c].dtype, type(df.iat[0, i]))
It shows both the pandas datatype of the columns of a dataframe, and the actual type of the first element of the column. It can help do see the difference between columns containing str elements and other containing datatime.datatime ones, while the datatype is just objects.
Use that on both of your dataframes, and the problem should become evident...
New to Pandas, so maybe I'm missing a big idea?
I have a Pandas DataFrame of register transactions with shape like (500,4):
Time datetime64[ns]
Net Total float64
Tax float64
Total Due float64
I'm working through my code in a Python3 Jupyter notebook. I can't get past sorting any column. Working through the different code examples for sort, I'm not seeing the output reorder when I inspect the df. So, I've reduced the problem to trying to order just one column:
df.sort_values(by='Time')
# OR
df.sort_values(['Total Due'])
# OR
df.sort_values(['Time'], ascending=True)
No matter which column title, or which boolean argument I use, the displayed results never change order.
Thinking it could be a Jupyter thing, I've previewed the results using print(df), df.head(), and HTML(df.to_html()) (the last example is for Jupyter notebooks). I've also rerun the whole notebook from import CSV to this code. And, I'm also new to Python3 (from 2.7), so I get stuck with that sometimes, but I don't see how that's relevant in this case.
Another post has a similar problem, Python pandas dataframe sort_values does not work. In that instance, the ordering was on a column type string. But as you can see all of the columns here are unambiguously sortable.
Why does my Pandas DataFrame not display new order using sort_values?
df.sort_values(['Total Due']) returns a sorted DF, but it doesn't update DF in place.
So do it explicitly:
df = df.sort_values(['Total Due'])
or
df.sort_values(['Total Due'], inplace=True)
My problem, fyi, was that I wasn't returning the resulting dataframe, so PyCharm wasn't bothering to update said dataframe. Naming the dataframe after the return keyword fixed the issue.
Edit:
I had return at the end of my method instead of
return df,
which the debugger must of noticed, because df wasn't being updated in spite of my explicit, in-place sort.
The fillna method of pandas dataframe is behaving in a weird way. I have a dataframe where the columns are of type datetime.date() and the indices are of type datetime.time(). On each cell I have values, and sometimes I have "nan".
Now the indices have a frequency of minutes, but some minutes are missing. For example, the time 7:18 does not exist in my dataframe. Now when I run the method fillna(method='ffill') on the dataframe, it generates the 7.17 and fills it with an absurd value.
UPDATE
In the new dataframe once the fillna is done, actually the 7.17 does not appear in the index, but if you look for it explicitly (eg using .loc), it has a value, which is plotted in the .plot() option...
I have also build my own fillna function manually, basically it goes column by column and fills the na with the previous value. In the debugging, when it arrives to 7.16 it jumps to 7.18 like nothing has happened, because there is not "nan" in 7.16, neither in 7.18. But the result has the same weird value at 7.17, which does not appear in the dataframe unless you ask for it explicitly or you plot it.
Note that when I use fillna(method='bfill') the issue is not happening...
UPDATE 2
More things I discovered: if I save the dataframe once I have done the fillna to a csv, and then load it again, the strange value does no appear at 7.17. But if I parse the dates when I load the csv to a dataframe, then it does. Is not explicitly in the dataframe, but it appears when you plot it.
Anyone knows why is that happening?