New to pandas - I have been trying to use the pandas.merge_asof to join two datasets together by shared ID first, then merge by nearest timestamp to the timestamp in df1.
The issue is that I have discovered is that both left_on and right_on must be int. I have one column that contains NaNs and they must remain. Floats was also ineffective. From my research on Stackoverflow, I found out that the latest version of Pandas, 24.02 has this functionality where you simply convert the column to Int64. However, the version of pandas I have available at work is 23.xx and cannot be upgraded at this time.
What is my easiest option? If I were to simply remove the rows associated with the NaNs values in the one column, could I simply add them back later, and then change the dtype back from int to object? Would this disrupt anything?
I did two methods:
1) I set the Nan's to -1. (there was no id that had -1 in the other dataset). Then put them back to Nan after.
2) I removed the records associated with the Nan's for that column, and put the records back in after.
I tried to compare results (and reset the indices, sort by timestamp), but I kept getting false. Both should give the same result regardless.
Related
For the life of me, I cant figure how to combine these two dataframes. I am using the newest most updated versions of all softwares, including Python, Pandas and Dask.
#pandasframe has 10k rows and 3 columns -
['monkey','banana','furry']
#daskframe has 1.5m rows, 1column, 135 partitions -
row.index: 'monkey_banana_furry'
row.mycolumn = 'happy flappy tuna'
my dask dataframe has a string as its index for accessing,
but when i do daskframe.loc[index_str] it returns a dask dataframe, but i thought it was supposed to return one single specific row. and i dont know how to access the row/value that i need from that dataframe. what i want is to input the index, and output one specific value.
what am i doing wrong?
Even pandas.DataFrame.loc don't return a scalar if you don't specify a label for the columns.
Anyways, to get a scalar in your case, first, you need to add dask.dataframe.DataFrame.compute so you can get a pandas dataframe (since dask.dataframe.DataFrame.loc returns a dask dataframe). And only then, you can use the pandas .loc.
Assuming (dfd) is your dask dataframe, try this :
dfd.loc[index_str].compute().loc[index_str, "happy flappy tuna"]
Or this :
dfd.loc[index_str, "happy flappy tuna"].compute().iloc[0]
i have not seen such question, so if you happen to know the answer or have seen the same question, please let me know
i have a dataframe in pandas with 4 columns and 5k rows, one of the columns is "price" and i need to do some manipulations with it. but the data was parsed from web-page and it is not clean, so i cannot convert this column to integer type after getting rid of dollar sign and comas. i found out that it also contains data in the format 3500/mo. so i need to filter cells with /mo and decide whether i can drop them, basing on how many of those i have and what is the price.
now, i have managed to count those cells using
df["price"].str.contains("/").sum()
but when i want to see those cells, i cannot do that, because when i create another variable to extract slash-containing cells and use "contains" or smth, i get the series with true/false values - showing me the condition of whether the cell does or does not contain that slash, while i actually need to see cells themselves. any ideas?
You need to use the boolean mask returned by df["price"].str.contains("/") as index to get the respective rows, i.e., df[df["price"].str.contains("/")] (cf. the pandas docs on indexing).
Comparing:
df.loc[:,'col1']
df.loc[:,['col1']]
Why does (2) create a DataFrame, while (1) creates a Series?
in principle when it's a list, it can be a list of more than one column's names, so it's natural for pandas to give you a DataFrame because only DataFrame can host more than one column. However, when it's a string instead of a list, pandas can safely say that it's just one column, and thus giving you a Series won't be a problem. Take the two formats and two outcomes as a reasonable flexibility to get whichever you need, a series or a dataframe. sometimes you just need specifically one of the two.
I have this table:
And I just want all the rows based on the first row, so I wrote
df[(df['product']==df.loc[[0],['product']])&(df['program_code']==df.loc[[0],['program_code']])]
etc for all the other columns that isn't sum
Which should return the first ~30 rows
Instead I get Must pass DataFrame with boolean values only
If I check to see if there boolean, which you would think as I'm comparing the values to itself and they're homogeneous, I get:
It's like my dataframe somehow shifted and I get two nan's. I'm sure there's a feature whereby this shifting is important, but I don't even know when I do it.
And even if you solve that. I get:
But if I HAND type it, I get
Success!
So maybe the item isnt' right
I'm tearing my hair out a bit with this one. I've imported two csv's into pandas dataframes both have a column called SiteReference i want to use pd.merge to join dataframes using SiteReference as a key.
Initial merged failed as pd.read took different interpretations of the SiteReference values, in one instance 380500145.0 in the other 380500145 both stored as objects. I ran Regex to clean the columns and then pd.to_numeric, this resulted in one value of 380500145.0 and another of 3.805001e+10. They should both be 380500145. I then attempted;
df['SiteReference'] = df['SiteReference'].astype(int).astype('str')
But got back;
ValueError: cannot convert float NaN to integer
How can i control how pandas is dealing with these, preferably on import?
Perharps the best solution is to avoid that pd.read affect the type of this field :
df=pd.read_csv('data.csv',sep=',',dtype={'SiteReference':str})
Following the discussion in the comments, if you want to format floats as integer strings, you can use this:
df['SiteReference'] = df['SiteReference'].map('{:,.0f}'.format)
This should handle null values gracefully.