I have a table named df with two columns - Name and Data. The table is something as follows
I am trying to create all possible combinations of values from the Data column and concat the results as separate columns to the existing table. Basically, in every subsequent column, two of the names will take values as 2 and 1.5 and the rest will take the value as 1. I am looking for output as similar to the following table:
Though I have been able to figure out the combination of names that will take the values as 2 and 1.5 in the next column using the following code
for index in list(combinations(df[['Name']].index,2)):
print(df[['Name']].loc[index,:])
print('\n')
However, I am stuck on how to create the fresh columns as mentioned above. Any help on the same is highly appreciated.
I think you are looking for permutations, not combinations. In this case we can generate those and transpose the data. After the transpose we can rename the columns.
import pandas as pd
from itertools import permutations
df = pd.DataFrame({'Name':['A','B','C','D'],
'Data':[1,2,1,1.5]})
df = pd.DataFrame(list(permutations(df.Data.values,4)), columns=df.Name.values).T
df.columns = [f'Data{x+1}' for x in df.columns]
df.reset_index(inplace=True)
df.rename(columns={'index':'Name'}, inplace=True)
Or:
pd.DataFrame(list(permutations(df.Data.values,4)), columns=df.Name.values).T.add_prefix('Data').rename_axis('Name').reset_index()
Output
Name Data1 Data2 Data3 Data4 Data5 Data6 Data7 Data8 Data9 ... \
0 A 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 ...
1 B 2.0 2.0 1.0 1.0 1.5 1.5 1.0 1.0 1.0 ...
2 C 1.0 1.5 2.0 1.5 2.0 1.0 1.0 1.5 1.0 ...
3 D 1.5 1.0 1.5 2.0 1.0 2.0 1.5 1.0 1.5 ...
Related
I have the following dataframe:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, 5, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
I want to do a ffill() on column B with df["B"].ffill(inplace=True) which results in the following df:
A B C D
0 NaN 2.0 NaN 0.0
1 3.0 4.0 NaN 1.0
2 NaN 4.0 5.0 NaN
3 NaN 3.0 NaN 4.0
Now I want to replace all NaN values with their corresponding value from column B. The documentation states that you can give fillna() a Series, so I tried df.fillna(df["B"], inplace=True). This results in the exact same dataframe as above.
However, if I put in a simple value (e.g. df.fillna(0, inplace=True), then it does work:
A B C D
0 0.0 2.0 0.0 0.0
1 3.0 4.0 0.0 1.0
2 0.0 4.0 5.0 0.0
3 0.0 3.0 0.0 4.0
The funny thing is that the fillna() does seem to work with a Series as value parameter when operated on another Series object. For example, df["A"].fillna(df["B"], inplace=True) results in:
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 4.0 4.0 NaN 5
3 3.0 3.0 NaN 4
My real dataframe has a lot of columns and I would hate to manually fillna() all of them. Am I overlooking something here? Didn't I understand the docs correctly perhaps?
EDIT I have clarified my example in such a way that 'ffill' with axis=1 does not work for me. In reality, my dataframe has many, many columns (hundreds) and I am looking for a way to not have to explicitly mention all the columns.
Try changing the axis to 1 (columns):
df = df.ffill(1).bfill(1)
If you need to specify the columns, you can do something like this:
df[["B","C"]] = df[["B","C"]].ffill(1)
EDIT:
Since you need something more general and df.fillna(df.B, axis = 1) is not implemented yet, you can try with:
df = df.T.fillna(df.B).T
Or, equivalently:
df.T.fillna(df.B, inplace=True)
This works because the indices of df.B coincides with the columns of df.T so pandas will know how to replace it. From the docs:
value: scalar, dict, Series, or DataFrame.
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
So, for example, the NaN in column 0 at row A (in df.T) will be replaced for the value with index 0 in df.B.
How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.
I have a dataframe with some information on it. I created another dataframe that is larger and has default values in it. I want to update the default dataframe with the values from the first dataframe. I'm using df.update but nothing is happening. Here is the code:
new_df = pd.DataFrame(index=range(25))
new_df['Column1'] = 1
new_df['Column2'] = 2
new_df.update(old_df)
Here, old_df has 2 rows, indexed 5,6 with some random values in Column1 and Column2 and nothing else. I'm expecting these rows to overwrite the default values in new_df, what am I doing wrong?
This works for me, so I assume the problem is in the part of the code you haven't shown us.
import pandas as pd
import numpy as np
new_df = pd.DataFrame(index=range(25))
old_df = pd.DataFrame(index=[5,6])
new_df['Column1'] = 1
new_df['Column2'] = 2
old_df['Column1'] = np.nan
old_df['Column2'] = np.nan
old_df.loc[5,'Column1'] = 9
old_df.loc[6,'Column2'] = 7
new_df.update(old_df)
print(new_df.head(10))
Output:
Column1 Column2
0 1.0 2.0
1 1.0 2.0
2 1.0 2.0
3 1.0 2.0
4 1.0 2.0
5 9.0 2.0
6 1.0 7.0
7 1.0 2.0
8 1.0 2.0
9 1.0 2.0
As you don't provide us how you construct/get old_df, before do the update, make sure that the type of both indexes is the same.
new_df.index = new_df.index.astype('int64')
old_df.index = old_df.index.astype('int64')
One int is not equal to one string 1 != '1'. So update() doesn't found common rows in yours dataframes and as nothing to do.
I have a dataframe column which contains a list of numbers from a .csv. These numbers range from 1-1400 and may or may not be repeated and the a NaN value can appear pretty much anywhere at random.
Two examples would be
a=[1,4,NaN,5,6,7,...1398,1400,1,2,3,NaN,8,9,...,1398,NaN]
b=[1,NaN,2,3,4,NaN,7,10,...,1398,1399,1400]
I would like to create another column that finds the first 1-1400 and records a '1' in the same index and if the second set of 1-1400 exists, then mark that down as a '2' in the new column
I can think of some roundabout ways using temporary placeholders and some other kind of checks, but I was wondering if there was a 1-3 liner to do this operation
Edit1: I would prefer there to be a single column returned
a1=[1,1,NaN,1,1,1,...1,1,2,2,2,NaN,2,2,...,2,NaN]
b1=[1,NaN,1,1,1,NaN,1,1,...,1,1,1]
You can use groupby() and cumcount() to count numbers in each column:
# create new columns for counting
df['a1'] = np.nan
df['b1'] = np.nan
# take groupby for each value in column `a` and `b` and count each value
df.a1 = df.groupby('a').cumcount() + 1
df.b1 = df.groupby('b').cumcount() + 1
# set np.nan as it is
df.loc[df.a.isnull(), 'a1'] = np.nan
df.loc[df.b.isnull(), 'b1'] = np.nan
EDIT (after receiving a comment of 'does not work'):
df['a2'] = df.ffill().a.diff()
df['a1'] = df.loc[df.a2 < 0].groupby('a').cumcount() + 1
df['a1'] = df['a1'].bfill().shift(-1)
df.loc[df.a1.isnull(), 'a1'] = df.a1.max() + 1
df.drop('a2', axis=1, inplace=True)
df.loc[df.a.isnull(), 'a1'] = np.nan
you can use diff to check when the difference between two following values is negative, meaning of the start of a new range. Let's create a dataframe:
import pandas as pd
import numpy as np
# to create a dataframe with two columns my range go up to 12 but 1400 is the same
df = pd.DataFrame({'a':[1,4,np.nan,5,10,12,2,3,4,np.nan,8,12],'b':range(1,13)})
df.loc[[4,8],'b'] = np.nan
Because you have 'NaN', you need to use ffill to fill NaN with previous value and you want the opposite of the row (using ~) where the diff is greater or equal than 0 (I know it sound like less than 0, but not exactely here as it miss the first row of the dataframe). For column 'a' for example
print (df.loc[~(df.a.ffill().diff()>=0),'a'])
0 1.0
6 2.0
Name: a, dtype: float64
you get the two rows where a "new" range start. To use this property to create 'a1', you can do:
# put 1 in the rows with a new range start
df.loc[~(df.a.ffill().diff()>=0),'a1'] = 1
# create a mask to select notnull row in a:
mask_a = df.a.notnull()
# use cumsum and ffill on column a1 with the mask_a
df.loc[mask_a,'a1'] = df.loc[mask_a,'a1'].cumsum().ffill()
Finally, for several column, you can do:
list_col = ['a','b']
for col in list_col:
df.loc[~(df[col].ffill().diff()>=0),col+'1'] = 1
mask = df[col].notnull()
df.loc[mask,col+'1'] = df.loc[mask,col+'1'].cumsum().ffill()
and with my input, you get:
a b a1 b1
0 1.0 1.0 1.0 1.0
1 4.0 2.0 1.0 1.0
2 NaN 3.0 NaN 1.0
3 5.0 4.0 1.0 1.0
4 10.0 NaN 1.0 NaN
5 12.0 6.0 1.0 1.0
6 1.0 7.0 2.0 1.0
7 3.0 8.0 2.0 1.0
8 4.0 NaN 2.0 NaN
9 NaN 10.0 NaN 1.0
10 8.0 11.0 2.0 1.0
11 12.0 12.0 2.0 1.0
EDIT: you can even do it in one line for each column, same result:
df['a1'] = df[df.a.notnull()].a.diff().fillna(-1).lt(0).cumsum()
df['b1'] = df[df.b.notnull()].b.diff().fillna(-1).lt(0).cumsum()
There are a few questions here on this topic, but none seem to be helpful in my case. Here's a dumbed down version of what I want:
This is the csv file of interest: http://pastebin.com/rP7tPDse
I'm creating the pivot table as:
piv = pd.read_csv("test.csv",delimiter = "\s+").pivot_table('z','x','y')
And this returns
y 0.0 1.0 1.3 2.0
x
0.0 1.0 5.0 NaN 4.0
1.0 3.0 4.0 NaN 6.0
1.5 NaN NaN 7.0 NaN
2.0 3.0 5.0 NaN 7.0
I would like to find a slice of this array as a pivot_table, such as:
y 1.3 2.0
x
0.0 NaN 4.0
1.0 NaN 6.0
Based on the x and y values. I want to include the NaN's as well, to do processing on them later. Help much appreciated.
EDIT: updating the question to be more specific.
I'm looking to extract a pivot table that has values denoted by the column 'z' and indexed by 'x' and 'y', with the condition that:
All x values between arbitrary xmin and xmax
All y values between arbitrary ymin and ymax
From piv, as defined above, I want to do something like:
piv.loc[(piv.y <= 2.0) &
(piv.y >= 1.3) &
(piv.x >= 0.0) &
(piv.x <= 1.2)]
And this would yield me the example answer, above.
Also, in the actual dataset, which I did not post here, there are many more columns. 'x', 'y' and 'z' are just some of them.
When I copied dataframe, the columns were strings and rows were floats.
To get the columns as float
df.columns = df.columns.astype(float)
Now you can pd.IndexSlice
df.loc[pd.IndexSlice[0:1], pd.IndexSlice[1.3:2]]