Outer merging two dataframes where one contains a StringArray raises ValueError - python

I am trying to perform an outer (or left) merge on two dataframes, the latter of which has an exclusive column with a dtype of "StringDtype", and this raises an error like ValueError: StringArray requires a sequence of strings or pandas.NA. I understand that I can use Series.astype(str) to cast the column to the "object" dtype, but the dataframe has many of these columns and it seems unnecessary to me. I'm wondering if this is a bug or is there another workaround I'm not aware of.
Here is an example to recreate the error:
import pandas as pd
df1 = pd.DataFrame(dict(id=pd.Series([1], dtype=int)))
df2 = pd.DataFrame(dict(id=pd.Series([], dtype=int), first_name=pd.Series([], dtype="string")))
df_final = df1.merge(df2, on="id", how="outer") # "outer" can be replaced with "left" with the same effect
This should result in a DataFrame with a single row containing the id of 1 and the first_name NA or something similar, instead it raises the ValueError.
This will work if I cast the "first_name" column to "Object" with df2.first_name = df2.first_name.astype(str) but I'd like to avoid this as I explained in the first paragraph.
Pandas version 1.0.5 is installed.

Related

How to maintain same index for dictonary from dataframe

When I tried to run this code:
X_test = df.values
df_new = ks.DataFrame(X_test, columns = ['Sales','T_Year','T_Month','T_Week','T_Day','T_Hour'])
I am getting new index for df_new data frame which is not the same has df.
I tried changing the code below to retain index for dictionary. However it gives an error:
X_test = df.values(index=df.index)
'numpy.ndarray' object is not callable.
Is there a way to maintain an index for df_new which are the same has df dataframe?
DataFrames have a set_index() method in order to manually set the "index column". Koalas in particular accepts as main argument:
keys: label or array-like or list of labels/arrays
This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index and np.ndarray.
By that, you can pass the Index object of your original df:
X_test = df.values
df_new = ks.DataFrame(X_test, columns = ['Sales','T_Year','T_Month','T_Week','T_Day','T_Hour'])
df_new = df_new.set_index(df.index)
Now about the line you are getting an error:
X_test = df.values(index=df.index)
The errors arise due to the fact that you are kind of confusing numpy arrays with pandas DataFrames.
When you call df.values of a DataFrame df, this returns a np.ndarray object with all the dataframe values without the index.
This is not a function, so you cannot "call it" by writing (index=df.index).
Numpy arrays don't have custom indixes, they are just arrays. Your df_new only cares about that, and you can set it as I showed above.
Disclaimer: I wasn't able to install koalas for this answer, so this is only tested in pandas Dataframes. If koalas does support pandas' interface completely, that should work.

How can these two dataframes be merged on a specific key?

I have two dataframes, both with a column 'hotelCode' that is type string. I made sure to convert both columns to string beforehand.
The first dataframe, we'll call old_DF looks like so:
and the second dataframe new_DF looks like:
I have been trying to merge these unsuccessfully. I've tried
final_DF = new_DF.join(old_DF, on = 'hotelCode')
and get this error:
I've tried a variety of things: changing the index name, various merge/join/concat and just haven't been successful.
Ideally, I will have a new dataframe where you have columns [[hotelCode, oldDate, newDate]] under one roof.
import pandas as pd
final_DF = pd.merge(old_DF, new_DF, on='hotelCode', how='outer')

Can I get concat() to ignore column names and work only based on the position of the columns?

The docs , at least as of version 0.24.2, specify that pandas.concat can ignore the index, with ignore_index=True, but
Note the index values on the other axes are still respected in the
join.
Is there a way to avoid this, i.e. to concatenate based on the position only, and ignoring the names of the columns?
I see two options:
rename the columns so they match, or
convert to numpy, concatenate in
numpy, then from numpy back to pandas
Are there more elegant ways?
For example, if I want to add the series s as an additional row to the dataframe df, I can:
convert s to frame
transpose it
rename its columns so they are the
same as those of df
concatenate
It works, but it seems very "un-pythonic"!
A toy example is below; this example is with a dataframe and a series, but the same concept applies with two dataframes.
import pandas as pd
df=pd.DataFrame()
df['a']=[1]
df['x']='this'
df['y']='that'
s=pd.Series([3,'txt','more txt'])
st=s.to_frame().transpose()
st.columns=df.columns
out= pd.concat( [df, st] , axis=0, ignore_index=True)
In the case of 1 dataframe and 1 series, you can do:
df.loc[df.shape[0], :] = s.values

Inconsistent error when select a column whose heading is a datetime type in pandas?

Need some help on a very odd pandas problem.
I have this dataframe df_raw whose column index is a mix of strings and datetime type. If I run this code:
df = df_raw
month = pd.datetime(2018,1,1)
assert month in df.columns # this always passes
df.loc[:,month]
It raises a KeyError basically saying 1/1/2018 is not in the column index. But it obviously is, otherwise it won't pass the assertion.
More confusingly, if I modify the first line a little bit:
df = df_raw.iloc[:, 0:]
month = pd.datetime(2018,1,1)
assert month in df.columns # this always passes
df.loc[:,month]
It will give the correct result. But I didn't do anything different except re-selecting the entire dataframe (also created a copy instead of a view).
I've been working with similar dataframes where the column index is a mix of datetime type and other types, this is the first time encountering this problem. Can some Python guru enlighten me what is this voodoo magic I'm dealing with?

Python Pandas Dataframes comparison on 2 columns (with where clause)

I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.
You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.

Categories