How to find the index column in multiindex? - python

I want to change the first index column to integer type.
ex)0.0->0, 1.0->1 2.0->2 ....
However, I can't search for the first column. As you can see, it's made up of multi-index. plz help me..
I succeeded in approaching a single value using the Pandas grammar. However, I don't know how to change the whole value of first column.
sum count
timestamp(hour) goods price price
0.0 1 1000 40
2 200 29
3 129 11
4 76 5
1.0 1 1000 40
2 200 29
3 129 11
4 76 5
.
.
.
In[61] pivot1.index[0][0]
Out[62] 0.0

You can use DataFrame.rename with level=0:
df = pd.DataFrame({
'col':[4,5,4,5,5,4],
'timestamp(hour)':[7,8.0,8,8.0,8,3],
'goods':list('aaabbb')
}).set_index(['timestamp(hour)','goods'])
print (df)
col
timestamp(hour) goods
7.0 a 4
8.0 a 5
a 4
b 5
b 5
3.0 b 4
df = df.rename(int, level=0)
print (df)
col
timestamp(hour) goods
7 a 4
8 a 5
a 4
b 5
b 5
3 b 4

You could:
df.index = df.index.set_levels([df.index.levels[0].astype(int), df.index.levels[1]])
But the answer of jezrael is better I guess.

Related

How to assign a column the value that is above it only if a condition is met?

So I have a dataframe where I have some empty values in a column. I need those empty values to be assigned to the next real value above them, whether it is 1 row above or 4 rows above. But, the caveat is that I only needs those empty values to be filled in if a certain condition is met.
Dataframe currently looks like:
Column A
Column B
1
100
1
NaN
1
NaN
2
150
2
NaN
2
NaN
3
NaN
3
NaN
4
60
5
70
5
NaN
I need it to look like:
Column A
Column B
1
100
1
100
1
100
2
150
2
150
2
150
3
NaN
3
NaN
4
60
5
70
5
70
So the first value for each grouping in column A needs to be carried out for that grouping in column B...all rows with a 1 in column A should have the same column B value. All rows with a 2 in column A should have the same column B value. The value it should be will always be the first value. In other words, the first row a new value comes up in column A will contain the correct value in Column B that should be carried down.
I really have no idea how to approach this. I was thinking about using groupby but that didn't make much sense.
I think groupby is the way to go:
g = df.groupby('Column A')
df['Column B'] = g.ffill()
Output:
Column A Column B
0 1 100.00
1 1 100.00
2 1 100.00
3 2 150.00
4 2 150.00
5 2 150.00
6 3 NaN
7 3 NaN
8 4 60.00
9 5 70.00
10 5 70.00

Adding and multiplying values of a dataframe in Python

I have a dataset with multiple columns and rows. The rows are supposed to be summed up based on the unique value in a column. I tried .groupby but I want to retain the whole dataset and not just summed up columns based on one unique column. I further need to multiple these individual columns(values) with another column.
For example:
id A B C D E
11 2 1 2 4 100
11 2 2 1 1 100
12 1 3 2 2 200
13 3 1 1 4 190
14 Nan 1 2 2 300
I would like to sum up columns B, C & D based on the unique id and then multiply the result by column A and E in a new column F. I do not want to sum up the values of column A & E
I would like the resultant dataframe to be something like this, which also deals with NaN and while calculating skips the NaN value and moves onto further calculation:
id A B C D E F
11 2 3 3 5 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
If the above is unachievable then I would like something as, where the rows are same but the calculation is what I have stated above based on the same id:
id A B C D E F
11 2 3 3 5 100 9000
11 2 2 1 1 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
My logic earlier was to apply groupby on the columns B, C, D and then multiply but that is not working out for me. If the above dataframes are unachieavable then please let me know how can i perform this calculation and then merge/join the results with the original file with just E column.
You must first sum verticaly the columns B, C and D for common id, then take the horizontal product:
result = df.groupby('id').agg({'A': 'first', 'B':'sum', 'C': 'sum', 'D': 'sum',
'E': 'first'})
result['F'] = result.fillna(1).astype('int64').agg('prod', axis=1)
It gives:
A B C D E F
id
11 2.0 3 3 5 100 9000
12 1.0 3 2 2 200 2400
13 3.0 1 1 4 190 2280
14 NaN 1 2 2 300 1200
Beware: id is the index here - use reset_index if you want it to be a normal column.

adding two pandas dataframe value only if the row and column value is the same

I have two dataframes with different sizes where one is bigger than the other but the second data frame has more columns.
I'm having problems with trying to add a data frame if it has the same column & row value as the other data frame which in this case is id
this is some dummy data and how I was trying to solve it
import pandas as pd
df1 = pd.DataFrame([(1,2,3),(3,4,5),(5,6,7),(7,8,9),(100,10,12),(100,10,12),(100,10,12)], columns=['id','value','c'])
df2 = pd.DataFrame([(1,200,3,4,6),(3,400,3,4,6),(5,600,3,4,6),(5,620,3,4,6)], columns=['id','value','x','y','z'])
so if id of the df1 and df2 are the same then add the column value by the value in "whatToAdd"
data
df1:
id value c
1 2 3
3 4 5
5 6 7
7 8 9
100 10 12
100 10 12
100 10 12
df2:
id value x y z
1 200 3 4 6
3 400 3 4 6
5 600 3 4 6
5 620 3 4 6
expected:
Out:
id value x y z
1 202 3 4 6
3 404 3 4 6
5 606 3 4 6
5 626 3 4 6
tried:
for each in df1.a:
if(df2.loc[df2['a'] == each]):
df2['a']+=df['a']
spew out an error "The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." which confusing for me cause i tried:
df2.loc[df2['a']==1
out of the loop and it works
Once you set both data frames to have same index:
df1 = df1.set_index("id")
df2 = df2.set_index("id")
You can do one very simple operation:
mask = df1.index.isin(df2.index)
df2["value"] += df1.loc[mask, "value"]
Output:
value x y z
id
1 202 3 4 6
3 404 3 4 6
5 606 3 4 6
5 626 3 4 6
You can always do df2.reset_index() to get back to original setting.
You can using set_index with add, then follow with reindex
df1.set_index('id').add(df2.set_index('id'),fill_value=0).dropna(axis=0).reset_index().reindex(columns=df2.columns)
Out[193]:
id value x y z
0 1 202.0 3.0 4.0 6.0
1 3 404.0 3.0 4.0 6.0
2 5 606.0 3.0 4.0 6.0
3 5 626.0 3.0 4.0 6.0
Here is code I came up with. It uses a dict to look up the value for each id in df1. Map can then be used to look up the value for each id in df2, creating a series that is then added to df2['value'] to produce the desired result.
df1_lookup = dict(df1.set_index('id')['value'].items())
df2['value'] += df2['id'].map(lambda x: df1_lookup.get(x, 0))
Here is a one-liner.
df2.loc[:, 'value'] += [df1.set_index('id').loc[i, 'value'] for i in df2.id]
print(df2)
>>>
id value x y z
0 1 202 3 4 6
1 3 404 3 4 6
2 5 606 3 4 6
3 5 626 3 4 6

Pandas - remove row similar to other row

I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8

Replace NaN values in a dataframe with values in a series python

i have a table in pandas df
id count
0 10 3
1 20 4
2 30 5
3 40 NaN
4 50 NaN
5 60 NaN
6 70 NaN
also i have another pandas series s
0 1000
1 2000
2 3000
3 4000
what i want to do is replace the NaN values in my df with the respective values from series s.
my final output should be
id count
0 10 3
1 20 4
2 30 5
3 40 1000
4 50 2000
5 60 3000
6 70 4000
Any ideas how do achieve this?
Thanks in advance.
There is problem lenght of Series is different as length of NaN values in column count. So you need reindex Series by length of NaN:
s = pd.Series({0: 1000, 1: 2000, 2: 3000, 3: 4000, 5: 5000})
print (s)
0 1000
1 2000
2 3000
3 4000
5 5000
dtype: int64
df.loc[df['count'].isnull(), 'count'] =
s.reindex(np.arange(df['count'].isnull().sum())).values
print (df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0
It's as simple as this:
df.count[df.count.isnull()] = s.values
In this case, I prefer iterrows for its readability.
counter = 0
for index, row in df.iterrows():
if row['count'].isnull():
df.set_value(index, 'count', s[counter])
counter += 1
I might add that this 'merging' of dataframe + series is a bit odd, and prone to bizarre errors. If you can somehow get the series into the same format as the dataframe (aka add some index/column tags, then you might be better served by the merge function).
You can re-index your Series with indexes of np.nan from dataframe and than fillna() with your Series:
s.index = np.where(df['count'].isnull())[0]
df['count'] = df['count'].fillna(s)
print(df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0

Categories