Pandas convert from Int64Index to RangeIndex - python

I concatenated three dataframes. How can I print df.index in RangeIndex, instead of Int64Index?
My Input:
df = pd.concat([df1, df2, df3])
print(df.index)
My Output:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,
9,
...
73809, 73810, 73811, 73812, 73813, 73814, 73815, 73816, 73817,
73818],
dtype='int64', length=495673)
Desired Output:
RangeIndex(start=X, stop=X, step=X)

You can use reset_index to get desired indices. For example:
df = pd.concat([df1,df2,df3])
df.index
Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2], dtype='int64')
After resetting indices:
df.reset_index(inplace=True)
df.index
RangeIndex(start=0, stop=9, step=1)
Also it is good to use axis keyword in concat function.

you can use the built-in ignore_index option:
df = pd.concat([df1, df2, df3],ignore_index=True)
print(df.index)
From the docs:
ignore_index : boolean, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

Related

Compare multiple columns of two data frames using pandas

I have two data frames; df1 has Id and sendDate and df2 has Id and actDate. The two df's are not the same shape - df2 is a lookup table. There may be multiple instances of Id.
ex.
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
I want to add a boolean True/False in df1 to find when df1.Id == df2.Id and df1.sendDate == df2.actDate.
Expected output would add a column to df1:
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"],
"Match?": [True, False, False, False, True]})
I'm new to python from R, so please let me know what other info you may need.
Use isin and boolean indexing
import pandas as pd
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11",
"2018-01-06", "2018-01-06",
"2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
df1['Match'] = (df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate']))
print(df1)
Output:
Id sendDate Match
0 1 2019-09-24 True
1 1 2020-09-11 True
2 2 2018-01-06 False
3 3 2018-01-06 False
4 2 2019-09-24 True
The .isin() approaches will find values where the ID and date entries don't necessarily appear together (e.g. Id=1 and date=2020-09-11 in your example). You can check for both by doing a .merge() and checking when df2's date field is not null:
df1['match'] = df1.merge(df2, how='left', left_on=['Id', 'sendDate'], right_on=['Id', 'actDate'])['actDate'].notnull()
A vectorized approach via numpy -
import numpy as np
df1['Match'] = np.where((df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate'])),True,False)
You can use .isin():
df1['id_bool'] = df1.Id.isin(df2.Id)
df1['date_bool'] = df1.sendDate.isin(df2.actDate)
Check out the documentation here.

How to do point biserial correlation for multiple columns in one iteration

I am trying to calculate a point biserial correlation for a set of columns in my datasets. I am able to do it on individual variable, however if i need to calculate for all the columns in one iteration then it is showing an error.
Below is the code:
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
from scipy import stats
corr_list = {}
y = df['A'].astype(float)
for column in df:
x = df[['B','C','D']].astype(float)
corr = stats.pointbiserialr(x, y)
corr_list[['B','C','D']] = corr
print(corr_list)
TypeError: No loop matching the specified signature and casting was found for ufunc add
x must be a column not a dataframe, if you take the column instead of the dataframe , it will work. You can try this :
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
print(df)
from scipy import stats
corr_list = []
y = df['A'].astype(float)
for column in df:
x=df[column]
corr = stats.pointbiserialr(list(x), list(y))
corr_list.append(corr[0])
print(corr_list)
by the way you can use print(df.corr())and this give you the Correlation Matrix of the dataframe
You can use the pd.DataFrame.corrwith() function:
df[['B', 'C', 'D']].corrwith(df['A'].astype('float'), method=stats.pointbiserialr)
Output will be a list of the columns and their corresponding correlations & p-values (row 0 and 1, respectively) with the target DataFrame or Series. Link to docs:
B C D
0 4.547937e-18 0.400066 -0.094916
1 1.000000e+00 0.504554 0.879331

pandas largest value per group with multi columns / why does it only work when flattening?

For a pandas dataframe of:
import pandas as pd
df = pd.DataFrame({
'id': [1, 1, 2, 1], 'anomaly_score':[5, 10, 8, 100], 'match_level_0':[np.nan, 1, 1, 1], 'match_level_1':[np.nan, np.nan, 1, 1], 'match_level_2':[np.nan, 1, 1, 1]
})
display(df)
df = df.groupby(['id', 'match_level_0']).agg(['mean', 'sum'])
I want to calculate the largest rows per group.
df.columns = ['__'.join(col).strip() for col in df.columns.values]
df.groupby(['id'])['anomaly_score__mean'].nlargest(2)
Works but requires to flatten the multiindex for the columns.
Instead I want to directly use,
df.groupby(['id'])[('anomaly_score', 'mean')].nlargest(2)
But this fails with the key not being found.
Interestingly, it works just fine when not grouping:
df[('anomaly_score', 'mean')].nlargest(2)
For me working grouping by Series with first level of MultiIndex, but it seems bug why not working like in your solution:
print (df[('anomaly_score', 'mean')].groupby(level=0).nlargest(2))
id match_level_0
1 1.0 55
2 1.0 8
Name: (anomaly_score, mean), dtype: int64
print (df[('anomaly_score', 'mean')].groupby(level='id').nlargest(2))

Multilevel index won't go away

I have a dataframe, which consists of summary statistics of another dataframe:
df = sample[['Place','Lifeexp']]
df = df.groupby('Place').agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values([('Lifeexp', 'count')], ascending=False)
When looking at the structure, the dataframe has a multi index, which makes plot creations difficult:
df.columns
MultiIndex(levels=[['Lifeexp', 'Place'], ['count', 'mean', 'max', 'min', '']],
labels=[[1, 0, 0, 0, 0], [4, 0, 1, 2, 3]])
I tried the solutions of different questions here (e.g. this), but somehow don't get the desired result. I want df to have Place, count, mean,max, min as column names and delete Lifeexp so that I can create easy plots e.g. df.plot.bar(x = "Place", y = 'count')
I think solution should be simplify define column after groupby for prevent MultiIndex in columns:
df = df.groupby('Place')['Lifeexp'].agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values('count', ascending=False)

Remove a level from a MultiIndex

I need to remove a level (either by position or name) from a DataFrame's index and create a new DataFrame with the new index. The problem is that I end up having a non-unique index.
I had a look at Remove a level from a pandas MultiIndex but the problem is that the use of unique(), as the answer in there suggests, reduces the index to an array, that doesn't retain the names of the levels.
Other than using unique(), and then creating a new Index by stitching the label names onto the array, is there a more elegant solution?
index = [np.array(['foo', 'foo', 'qux']), np.array(['a', 'b', 'a'])]
data = np.random.randn(3, 2)
columns = ["X", "Y"]
df = pd.DataFrame(data, index=index, columns=columns)
df.index.names = ["Level0", "Level1"]
print df
X Y
Level0 Level1
foo a -0.591649 0.831599
b 0.049961 -1.524291
qux a -0.100124 -1.059195
index2 = df.reset_index(level=1, drop=True).index
df2 = pd.DataFrame(index=index2)
print df2.loc[idx['foo'], :]
Empty DataFrame
Columns: []
Index: [foo, foo]
If I understand you correctly, you are looking for a solution to get the first level index without duplicated values. Your result should be an Ìndex object without using unique and without explicitly creating the index again.
For your example data frame, you can use the following including get_level_values and drop_duplicates:
print(df.index.get_level_values(0).drop_duplicates())
Index(['foo', 'qux'], dtype='object', name='Level0')
Edit
For a more general solution either returning an Index or MultiIndex depending on the number of levels, you may use droplevel and drop_duplicates in conjunction:
print(df.index.droplevel(-1).drop_duplicates())
Index(['foo', 'qux'], dtype='object', name='Level0')
Here is the example from the linked SO post with 3 levels which are reduced to 2 levels mutltiindex with unique values:
tuples = [(0, 100, 1000),(0, 100, 1001),(0, 100, 1002), (1, 101, 1001)]
index_3levels=pd.MultiIndex.from_tuples(tuples,names=["l1","l2","l3"])
print(index_3levels)
MultiIndex(levels=[[0, 1], [100, 101], [1000, 1001, 1002]],
labels=[[0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 2, 1]],
names=['l1', 'l2', 'l3'])
index2level= index_3levels.droplevel(-1).drop_duplicates()
print(index2level)
MultiIndex(levels=[[0, 1], [100, 101]],
labels=[[0, 1], [0, 1]],
names=['l1', 'l2'])
# show unique values of new index
print(index2level)
[(0, 100) (1, 101)]

Categories