'combine first' in pandas produces NA error - python

I have two dataframes, each with a series of dates as the index. The dates to not overlap (in other words one date range from, say, 2013-01-01 through 2016-06-15 by month and the second DataFrame will start on 2016-06-15 and run quarterly through 2035-06-15.
Most of the column names overlap (i.e. are the same) and the join just does fine. However, there is one columns in each DataFrame that I would like to preserve as 'belonging' to the original DataFrame so that I have them both available for future use. I gave each a different name. For example, DF1 has a column entitled opselapsed_time and DF2 has a column entitled constructionelapsed_time.
When I try to combine DF1 and DF2 together using the command DF1.combine_first(DF2) or vice versa I get this error: ValueError: Cannot convert NA to integer.
Could someone please give me advice on how best to resolve?
Do I need to just stick with using a merge/join type solution instead of combine_first?

Found the best solution:
pd.tools.merge.concat([test.construction,test.ops],join='outer')
Joins along the date index and keeps the different columns. To the extent the column names are the same, it will join 'inner' or 'outer' as specified.

Related

Group By and ILOC Errors

I'm getting the following error when trying to groupby and sum by dataframe by specific columns.
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
I've checked other solutions and it's not a double column name header issue.
See df3 below which I want to group by on all columns except last two, I want to sum()
dfs head shows that if I just group by the columns names it works fine but not with iloc which I know to be the correct formula to pull back column I want to group by.
I need to use ILOC as final dataframe will have many more columns.
df.iloc[:,0:3] returns a dataframe. So you are trying to group dataframe with another dataframe.
But you just need a column list.
can you try this:
dfs = df3.groupby(list(df3.iloc[:,0:3].columns))['Churn_Alive_1','Churn_Alive_0'].sum()

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

Quandl + Python: The date column not "working"

I am trying to get some data through the API from quandl but the date column doesn't seem to work the same level as the other columns. E.g. when I use the following code:
data = quandl.get("WIKI/KO", trim_start = "2000-12-12", trim_end =
"2014-12-30", authtoken=quandl.ApiConfig.api_key)
print(data['Open'])
I end up with the below result
Date
2000-12-12 57.69
2000-12-13 57.75
2000-12-14 56.00
2000-12-15 55.00
2000-12-18 54.00
E.g. date appearing along with the 'Open' column. And when I try to directly include Date like this:
print(data[['Open','Date']]),
it says Date doesn't exist as a column. So I have two questions: (1) How do I make Date an actual column and (2) How do I select only the 'Open' column (and thus not the dates).
Thanks in advance
Why print(data['Open']) show dates even though Date is not a column:
quandle.get returns a Pandas DataFrame, whose index is a DatetimeIndex.
Thus, to access the dates you would use data.index instead of data['Date'].
(1) How do I make Date an actual column
If you wish to make the DatetimeIndex into a column, call reset_index:
data = data.reset_index()
print(data[['Open', 'Date']])
(2) How do I select only the 'Open' column (and thus not the dates)
To obtain a NumPy array of values without the index, use data['Open'].values.
(All Pandas Series and DataFrames have Indexs (that's Pandas' raison d'ĂȘtre!), so the only way obtain the values without the index is to convert the Series or DataFrame to a different kind of object, like a NumPy array.)

Python Pandas Dataframes comparison on 2 columns (with where clause)

I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.
You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.

Finding rows which aren't in two dataframes

I have the following dataframe:
symbol, name
abc Jumping Jack
xyz Singing Sue
rth Fat Frog
I then have another dataframe with the same structure (symbol + name). I need to output all the symbols which are in the first dataframe but not the second.
The name column is allowed to differ. For example I could have symbol = xyz in both dataframes but with different names. That is fine. I am simply trying to get the symbols which do not appear in both dataframes.
I am sure this can be done using pandas merge and then outputting the rows that didn't merge, but I just can't seem to get it right.
Use isin and negate the condition using ~:
df[~df['symbol'].isin(df1['symbol'])]
This will return rows where 'symbol' is present in your first df and not in the other df

Categories