How to combine dataframes based on index column name - python

Hello I am new to python and I have 2 dfs and a list of tickers and i would like to combine the 2 dfs based on a list of tickers. My second df had the tickers imported from an excel sheet and so the column names in the index are in a different order, I am not sure if that changes anything.
df1 looks like
df1
index
ABC
DEF
XYZ
avg
2
6
12
std
1
2
3
var
24
25
35
max
56
66
78
df 2
index
10
40
96
ticker
XYZ
ABC
DEF
Sector
Auto
Tech
Mining
I would like to combine them based on their ticker names in a third df with all the information so it looks something like this
df3
index
ABC
DEF
XYZ
avg
2
6
12
std
1
2
3
var
24
25
35
max
56
66
78
Sector
Tech
Mining
Auto
I have tried this
df3= pd.concat([df1,df2], ignore_index=True)
but it made a df where they were side by side instead of in one combine df. Any help would be appreciated.

You need to set the index
df2 = df2.set_index('index').T.set_index('ticker').T
out = pd.concat([df1,df2])
ABC DEF XYZ
index
avg 2 6 12
std 1 2 3
var 24 25 35
max 56 66 78
Sector Tech Mining Auto

Related

Python Pandas Multiindexing select rows that match all values in a list

Consider the following data frame.
import Pandas as pd
df = pd.Dataframe
df = pd.DataFrame()
df['Folder'] = [2,3,4,5 ,2,4,5, 2,3,4, 2,3,4,5,1]
df['Country'] = ['USA','USA','USA','USA' ,'Mexico','Mexico','Mexico', 'UK','UK','UK', 'Canada','Canada','Canada','Canada','Canada']
df['Data'] = [20,30,43,15 ,25,44,15, 26,37,47, 24,34,47,55,18]
df.set_index(['Country','Folder'], drop=True, inplace=True)
df
Data
Country Folder
USA 2 20
3 30
4 43
5 15
Mexico 2 25
4 44
5 15
UK 2 26
3 37
4 47
Canada 2 24
3 34
4 47
5 55
1 18
How do I collect the rows where Folder has all of lst=[1,3,4] in level Folder?
Data
Country Folder
Canada 2 24
3 34
4 47
5 55
1 18
OR
Data
Country Folder
Canada 3 34
4 47
1 18
Either would work for me. I want to know that Canada matches all of lst. lst may be up to 8 items long.
I have tried df.query("Folder in #lst") however that returns rows matching any of lst. I need matching All of lst.
Thanks in advance for any help.
Use GroupBy.transform witt convert values to sets and using issubset get all groups with all values in Folder by lst:
lst=[1,3,4]
f = lambda x: set(lst).issubset(set(x.index.get_level_values('Folder')))
mask = df.groupby('Country')['Data'].transform(f)
df1 = df[mask]
print (df1)
Data
Country Folder
Canada 2 24
3 34
4 47
5 55
1 18
Last if need only matched values:
df2 = df1[df1.index.isin(lst)]
This works for me in pandas 1.0.4. Which pandas version do you use?
df.query('Country == "Canada" and Folder in [1,3,4]')
or
l = [1,3,4]
c = 'Canada'
df.query('Country == #c and Folder in #l')
>>>
Data
Country Folder
Canada 3 34
4 47
1 18
This is an alternative to #jezrael's approach, where we group on the boolean values from isin and country:
In [38]: (df.groupby([df.index.isin([1,3,4], level='Folder'),
df.index.get_level_values('Country')])
.filter(lambda x: len(x)==3)
)
Out[38]:
Data
Country Folder
Canada 3 34
4 47
1 18
Take advantage of the fact that you have three numbers in the list, so if it matches all, then it should be 3.
To get all values, you could chunk the steps:
mapping = df.index.isin([1,3,4], level = 'Folder')
filtered = (pd.Series(mapping)
.groupby(df.index.get_level_values('Country'))
.transform(lambda x: sum(x)>=3)
)
In [61]: df.loc[filtered.array]
Out[61]:
Data
Country Folder
Canada 2 24
3 34
4 47
5 55
1 18

Pandas : While adding new rows, its replacing my existing dataframe values? [duplicate]

This question already has answers here:
Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
(4 answers)
Closed 2 years ago.
import pandas as pd
data = {'term':[2, 7,10,11,13],'pay':[22,30,50,60,70]}
df = pd.DataFrame(data)
pay term
0 22 2
1 30 7
2 50 10
3 60 11
4 70 13
df.loc[2] = [49,9]
print(df)
pay term
0 22 2
1 30 7
2 49 9
3 60 11
4 70 13
Expected output :
pay term
0 22 2
1 30 7
2 49 9
3 50 10
4 60 11
5 70 13
If we run above code, it is replacing the values at 2 index. I want to add new row with desired value as above to my existing dataframe without replacing the existing values. Please suggest.
You could not be able to insert a new row directly by assigning values to df.loc[2] as it will overwrite the existing values. But you can slice the dataframe in two parts and then concat the two parts along with third row to insert.
Try this:
new_df = pd.DataFrame({"pay": 49, "term": 9}, index=[2])
df = pd.concat([df.loc[:1], new_df, df.loc[2:]]).reset_index(drop=True)
print(df)
Output:
term pay
0 2 22
1 7 30
2 9 49
3 10 50
4 11 60
5 13 70
A possible way is to prepare an empty slot in the index, add the row and sort according to the index:
df.index = list(range(2)) + list(range(3, len(df) +1))
df.loc[2] = [49,9]
It gives:
term pay
0 2 22
1 7 30
3 10 50
4 11 60
5 13 70
2 49 9
Time to sort it:
df = df.sort_index()
term pay
0 2 22
1 7 30
2 49 9
3 10 50
4 11 60
5 13 70
That is because loc and iloc methods bring the already existing row from the dataframe, what you would normally do is to insert by appending a value in the last row.
To address this situation first you need to split the dataframe, append the value you want, concatenate with the second split and finally reset the index (in case you want to keep using integers)
#location you want to update
i = 2
#data to insert
data_to_insert = pd.DataFrame({'term':49, 'pay':9}, index = [i])
#split, append data to insert, append the rest of the original
df = df.loc[:i].append(data_to_insert).append(df.loc[i:]).reset_index(drop=True)
Keep in mind that the slice operator will work because the index is integers.

Pandas Multiindex get values from first entry of index

I have the following multiindex dataframe:
from io import StringIO
import pandas as pd
datastring = StringIO("""File,no,runtime,value1,value2
A,0, 0,12,34
A,0, 1,13,34
A,0, 2,23,34
A,1, 6,23,38
A,1, 7,22,38
B,0,17,15,35
B,0,18,17,35
C,0,34,23,32
C,0,35,21,32
""")
df = pd.read_csv(datastring, sep=',')
df.set_index(['File','no',df.index], inplace=True)
>> df
runtime value1 value2
File no
A 0 0 0 12 34
1 1 13 34
2 2 23 34
1 3 6 23 38
4 7 22 38
B 0 5 17 15 35
6 18 17 35
C 0 7 34 23 32
8 35 21 32
What I would like to get is just the first values of every entry with a new file and a different number
A 0 34
A 1 38
B 0 35
C 0 32
The most similar questions I could find where these
Resample pandas dataframe only knowing result measurement count
MultiIndex-based indexing in pandas
Select rows in pandas MultiIndex DataFrame
but I was unable to construct a solution from them. The best I got was the ix operation, but as the values technically are still there (just not on display), the result is
idx = pd.IndexSlice
df.loc[idx[:,0],:]
could, for example, filter for the 0 value but would still return the entire rest of the dataframe.
Is a multiindex even the right tool for the task at hand? How to solve this?
Use GroupBy.first by first and second level of MultiIndex:
s = df.groupby(level=[0,1])['value2'].first()
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
If need one column DataFrame use one element list:
df1 = df.groupby(level=[0,1])[['value2']].first()
print (df1)
value2
File no
A 0 34
1 38
B 0 35
C 0 32
Another idea is remove 3rd level by DataFrame.reset_index and filter by Index.get_level_values with boolean indexing:
df2 = df.reset_index(level=2, drop=True)
s = df2.loc[~df2.index.duplicated(), 'value2']
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
For the sake of completeness, I would like to add another method (which I would not have found without the answere by jezrael).
s = df.groupby(level=[0,1])['value2'].nth(0)
This can be generalized to finding any, not merely the first entry
t = df.groupby(level=[0,1])['value1'].nth(1)
Note that the selection was changed from value2 to value1 as for the former, the results of nth(0) and nth(1) would have been identical.
Pandas documentation link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html

DataFrame transform column values to new columns

I have following series:
project id type
First 130403725 PRODUCT 68
EMPTY 2
Six 130405706 PRODUCT 24
132517244 PRODUCT 33
132607436 PRODUCT 87
How I can transform type values to new columns:
PRODUCT EMPTY
project id
First 130403725 68 2
Six 130405706 24 0
132517244 33 0
132607436 87 0
This is a classic pivot table:
df_pivoted = df.pivot(index=["project", "id"], columns=["type"], values=[3])
I've used 3 as the index of the value column but it would be more clear if you would have named it.
Use unstack, because MultiIndex Series:
s1 = s.unstack(fill_value=0)
print (s1)
type EMPTY PRODUCT
project id
First 130403725 2 68
Six 130405706 0 24
132517244 0 33
132607436 0 87
For DataFrame:
df = s.unstack(fill_value=0).reset_index().rename_axis(None, axis=1)
print (df)
project id EMPTY PRODUCT
0 First 130403725 2 68
1 Six 130405706 0 24
2 Six 132517244 0 33
3 Six 132607436 0 87

Iteratively Capture Value Counts in Single DataFrame

I have a pandas dataframe that looks something like this:
id group gender age_grp status
1 1 m over21 active
2 4 m under21 active
3 2 f over21 inactive
I have over 100 columns and thousands of rows. I am trying to create a single pandas dataframe of the value_counts of each of the colums. So I want something that looks like this:
group1
gender m 100
f 89
age over21 98
under21 11
status active 87
inactive 42
Any one know a simple way I can iteratively concat the value_counts from each of the 100+ columns in the original dataset while capturing the name of the columns as a hierarchical index?
Eventually I want to be able to merge with another dataframe of a different group to look like this:
group1 group2
gender m 100 75
f 89 92
age over21 98 71
under21 11 22
status active 87 44
inactive 42 13
Thanks!
This should do it:
df.stack().groupby(level=1).value_counts()
id 1 1
2 1
3 1
group 1 1
2 1
4 1
gender m 2
f 1
age_grp over21 2
under21 1
status active 2
inactive 1
dtype: int64

Categories