I created two dataFrames and I want to subtract their values omitting two first columns in the first DataFrame.
df = pd.DataFrame({'sys':[23,24,27,30],'dis': [0.8, 0.8, 1.0,1.0], 'Country':['US', 'England', 'US', 'Germany'], 'Price':[500, 1000, 1500, 2000]})
print df
index = {'sys':[23,24,27,30]}
df2 = pd.DataFrame({ 'sys':[23,24,27,30],'dis': [0.8, 0.8, 1.0,1.0],'Price2':[300, 600, 4000, 1000], 'Price3': [2000, 1000, 600, 2000]})
df = df.set_index(['sys','dis', 'Country']).unstack().fillna(0)
df = df.reset_index()
df.columns.names =[None, None]
df.columns = df.columns.droplevel(0)
infile = pd.read_csv('benchmark_data.csv')
infile_slice = infile[(infile.System_num==26)]['Benchmark']
infile_slice = infile_slice.reset_index()
infile_slice = infile_slice.drop(infile_slice.index[4])
del infile_slice['index']
print infile_slice
dfNew = df.sub(infile_slice['Benchmark'].values, axis=0)
In this case I can substract only all values from all columns. How can I skip two first columns from df?
I've tried: dfNew = df.iloc[3:].sub(infile_slice['Benchmark'].values,axis=0), but it does not work.
DataFrames look like:
df:
England Germany US
0 23 0.8 0.0 0.0 500.0
1 24 0.8 1000.0 0.0 0.0
2 27 1.0 0.0 0.0 1500.0
3 30 1.0 0.0 2000.0 0.0
infile_slice:
Benchmark
0 3.3199
1 -4.0135
2 -4.9794
3 -3.1766
Maybe, this is what you are looking for?
>>> df
England Germany US
0 23 0.8 0.0 0.0 500.0
1 24 0.8 1000.0 0.0 0.0
2 27 1.0 0.0 0.0 1500.0
3 30 1.0 0.0 2000.0 0.0
>>> infile_slice
Benchmark
0 3.3199
1 -4.0135
2 -4.9794
3 -3.1766
>>> df.iloc[:, 4:] = df.iloc[:, 4:].sub(infile_slice['Benchmark'].values,axis=0)
>>> df
England Germany US
0 23 0.8 0.0 0.0 496.6801
1 24 0.8 1000.0 0.0 4.0135
2 27 1.0 0.0 0.0 1504.9794
3 30 1.0 0.0 2000.0 3.1766
>>>
You could use iloc as follows:
df_0_2 = df.iloc[:,0:2] # First 2 columns
df_2_end = df.iloc[:,2:].sub(infile_slice['Benchmark'].values, axis=0)
pd.concat([df_0_2, df_2_end], axis=1)
England Germany US
0 23 0.8 -3.3199 -3.3199 496.6801
1 24 0.8 1004.0135 4.0135 4.0135
2 27 1.0 4.9794 4.9794 1504.9794
3 30 1.0 3.1766 2003.1766 3.1766
Related
How do I print the number of rows dropped while executing the following code in python:
df.dropna(inplace = True)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
len(new_data)
Use:
np.random.seed(2022)
df = pd.DataFrame(np.random.choice([0,np.nan, 1], size=(10, 3)))
print (df)
0 1 2
0 NaN 0.0 NaN
1 0.0 NaN NaN
2 0.0 0.0 1.0
3 0.0 0.0 NaN
4 NaN NaN 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
7 NaN 0.0 1.0
8 1.0 1.0 NaN
9 1.0 0.0 NaN
You can count missing values before by DataFrame.isna with DataFrame.any and sum:
count = df.isna().any(axis=1).sum()
df.dropna(inplace = True)
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
Or get difference of size Dataframe before and after dropna:
orig = df.shape[0]
df.dropna(inplace = True)
count = orig - df.shape[0]
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
I have a dataset that has a number of numerical variables and a number of ordinal numeric variables. to fill missing value I want to use mean for numerical variables and use the median for the ordinal numeric variables. With the following code, each of them is created separately and is not collected in a database.
df = [['age', 'score'],
[10,1],
[20,""],
["",0],
[40,1],
[50,0],
["",3],
[70,1],
[80,""],
[90,0],
[100,1]]
df = pd.DataFrame(data[1:])
df.columns = data[0]
df = df[['age']].fillna(df.mean())
df = df[['score']].fillna(df.median())
pandas.DataFrame.fillna accepts dict with keys being column names, so you might do:
import pandas as pd
data = [['age', 'score'],
[10,1],
[20,None],
[None,0],
[40,1],
[50,0],
[None,3],
[70,1],
[80,None],
[90,0],
[100,1]]
df = pd.DataFrame(data[1:], columns=data[0])
df = df.fillna({'age':df['age'].mean(),'score':df['score'].median()})
print(df)
output
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
Keep in mind that empty string is different than NaN, latter might be created using python's None.
First replace empty strings to missing values and then replace mising values per columns:
df = df.replace('', np.nan)
df['age'] = df['age'].fillna(df['age'].mean())
df['score'] = df['score'].fillna(df['score'].median())
print (df)
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
You can also use DataFrame.agg for Series of aggregate values and pass to DataFrame.fillna:
df = df.replace('', np.nan)
print (df.agg({'age':'mean', 'score':'median'}))
age 57.5
score 1.0
dtype: float64
df = df.fillna(df.agg({'age':'mean', 'score':'median'}))
print (df)
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
I have a DataFrame (df1) as given below
Hair Feathers Legs Type Count
R1 1 NaN 0 1 1
R2 1 0 Nan 1 32
R3 1 0 2 1 4
R4 1 Nan 4 1 27
I want to merge rows based by different combinations of the values in each column and also want to add the count values for each merged row. The resultant dataframe(df2) will look like this:
Hair Feathers Legs Type Count
R1 1 0 0 1 33
R2 1 0 2 1 36
R3 1 0 4 1 59
The merging is performed in such a way that any Nan value will be merged with 0 or 1. In df2, R1 is calculated by merging the Nan value of Feathers (df1,R1) with the 0 value of Feathers (df1,R2). Similarly, the value of 0 in Legs (df1,R1) is merged with Nan value of Legs (df1,R2). Then the count of R1 (1) and R2(32) are added. In the same manner R2 and R3 are merged because Feathers value in R2 (df1) is similar to R3 (df1) and Legs value of Nan is merged with 2 in R3 (df1) and the count of R2 (32) and R3 (4) are added.
I hope the explanation makes sense. Any help will be highly appreciated
A possible way to do it is by replicating each of the rows containing NaN and fill them with values for the column.
First, we need to get the possible not-null unique values per column:
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}
Then iterate through each row of the dataframe and replace each NaN by the possible values for each column. We can do this using pandas.DataFrame.iterrows:
mask = df.iloc[:, :-1].isnull().any(axis=1)
# Keep the rows that do not contain `Nan`
# and then added modified rows
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
for c in row[row.isnull()].index:
# For each column of the row, replace
# Nan by possible values for the column
for v in unique_values[c]:
list_of_df.append(row.copy().fillna({c:v}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
The result is a dataframe where all the NaN have been filled with possible values for the column:
> df_res
Hair Feathers Legs Type Count
0 1.0 0.0 2.0 1.0 4.0
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
3 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
To get the final result of Count grouping by possible combinations of ['Hair', 'Feathers', 'Legs', 'Type'] we just need to do:
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 33.0
1 1.0 0.0 2.0 1.0 36.0
2 1.0 0.0 4.0 1.0 59.0
Hope it serves
UPDATE
If one or more of the elements in the row are missing, the procedure looking for all the possible combinations for the missing values at the same time. Let us add a new row with two elements missing:
> df
Hair Feathers Legs Type Count
0 1.0 NaN 0.0 1.0 1.0
1 1.0 0.0 NaN 1.0 32.0
2 1.0 0.0 2.0 1.0 4.0
3 1.0 NaN 4.0 1.0 27.0
4 1.0 NaN NaN 1.0 32.0
We will proceed in similar way, but the replacements combinations will be obtained using itertools.product:
import itertools
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
mask = df.iloc[:, :-1].isnull().any(axis=1)
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
cols = row[row.isnull()].index.tolist()
for p in itertools.product(*[unique_values[c] for c in cols]):
list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)
Hair Feathers Legs Type Count
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
6 1.0 0.0 0.0 1.0 32.0
0 1.0 0.0 2.0 1.0 4.0
3 1.0 0.0 2.0 1.0 32.0
7 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
8 1.0 0.0 4.0 1.0 32.0
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 65.0
1 1.0 0.0 2.0 1.0 68.0
2 1.0 0.0 4.0 1.0 91.0
I am attempting to multiple specific columns a value in their respective row.
For example:
X Y Z
A 10 1 0 1
B 50 0 0 0
C 80 1 1 1
Would become:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
The problem I am having is that it is timing out when I use mul(). My real dataset is very large. I tried to iterate it with loop in my real code as follows:
for i in range(1,df_final_small.shape[0]):
df_final_small.iloc[i].values[3:248] = df_final_small.iloc[i].values[3:248] * df_final_small.iloc[i].values[2]
Which when applied to the example dataframe would look like this:
for i in range(1,df_final_small.shape[0]):
df_final_small.iloc[i].values[1:4] = df_final_small.iloc[i].values[1:4] * df_final_small.iloc[i].values[0]
There must be a better way to do this, I am having problems figuring out how to only cast the multiplication to certain columns in the row rather than the entire row.
EDIT:
To detail further here is my df.head(5).
id gross 150413 Welcome Email 150413 Welcome Email Repeat Cust 151001 Welcome Email 151001 Welcome Email Repeat Cust 161116 eKomi 1702 Hot Leads Email 1702 Welcome Email - All Purchases 1804 Hot Leads ... SILVER GOLD PLATINUM Acquisition Direct Mail Conversion Direct Mail Retention Direct Mail Retention eMail cluster x y
0 0033333 46.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 10 -0.230876 0.461990
1 0033331 2359.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 9 0.231935 -0.648713
2 0033332 117.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 5 -0.812921 -0.139403
3 0033334 89.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 5 -0.812921 -0.139403
4 0033335 1908.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 7 -0.974142 0.145032
Just specify the columns you want to multiply. Example
df=pd.DataFrame({'A':10,'X':1,'Y':1,'Z':1},index=[1])
df.loc[:,['X', 'Y', 'Z']]=df.loc[:,['X', 'Y', 'Z']].values*df.iloc[:,0:1].values
If want to provide an arbitrary range of columns use iloc
range_of_columns= range(10,5001)+range(5030,10001)
df.iloc[:,range_of_columns].values*df.iloc[:,0:1].values #multiplying the range of columns with the first column
Using mul with axis = 0 also get the index value by get_level_values
df.mul(df.index.get_level_values(1),axis=0)
Out[167]:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
Also when the dataframe is way to big , you can split it and do it by chunk .
dfs = np.split(df, [2], axis=0)
pd.concat([x.mul(x.index.get_level_values(1), axis=0) for x in dfs])
Out[174]:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
Also I will recommend numpy broadcast
df.values*df.index.get_level_values(1)[:,None]
Out[177]: Int64Index([[10, 0, 10], [0, 0, 0], [80, 80, 80]], dtype='int64')
pd.DataFrame(df.values*df.index.get_level_values(1)[:,None],index=df.index,columns=df.columns)
Out[181]:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
I have 2 data frames
df1 =
city.population city.sys.population cnt cod message tmp
0 0 38 200 0.1642 1
df2=
A B C D E tmp
0 0 38 200 0.1642 1
0 0 38 200 0.1642 1
0 0 38 200 0.1642 1
0 0 38 200 0.1642 1
I want to merge/join the two data frames on tmp and should get the result like
A B C D E tmp population cnt cod
0 0 38 200 0.1642 1 0 38 200
0 0 38 200 0.1642 1 0 38 200
0 0 38 200 0.1642 1 0 38 200
0 0 38 200 0.1642 1 0 38 200
But I'm getting values for population, cnt and cod only for the first record. Is there any way to have the values in first record filled for all rows for population cnt and cod fields
You can join two data frame with append. Did you try it?
df1.append(df2)
df1.head()
Let me know if works.
For more look at the documentation
http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html
df3 = pd.merge(df2, df1, on='tmp', how='outer') should give you what you want. This is equivalent to a full outer join in SQL, if you are familiar with that terminology.
What this does is smush the two data frames df1 and df2 together so that the df3.tmp column is equal to the union of the values in df1.tmp and df2.tmp (i.e. the list of values you would get if you did df1.tmp + df2.tmp. So any rows in df2 that have df2.tmp == 1 will get the info from df1 where df1.tmp == 1 This will work if you want to include all the data from both df1 and df2. This way, if you have a column in df1 with a value in 'tmp' that is NOT in df2, your new dataframe will have 'NaN' values for columns A,B,C,D,E, and the data from df1, but you don't lose any data in the merge.
eg if df1 =
pop syspop ct cod msg tmp
0 0.0 0.0 30.0 200.0 0.1642 1.0
1 0.0 0.0 0.0 0.0 0.0000 3.0`
then df3=
a b c d e tmp pop syspop ct cod msg
0 0.0 0.0 38.0 200.0 0.1642 1.0 0.0 0.0 30.0 200.0 0.1642
1 0.0 0.0 38.0 200.0 0.1642 1.0 0.0 0.0 30.0 200.0 0.1642
2 0.0 0.0 38.0 200.0 0.1642 1.0 0.0 0.0 30.0 200.0 0.1642
3 0.0 0.0 38.0 200.0 0.1642 1.0 0.0 0.0 30.0 200.0 0.1642
4 0.0 0.0 0.0 0.0 0.0000 2.0 NaN NaN NaN NaN NaN
If you want other combinations of df1 and df2 (for example, you don't care about any values that are in df1 that are not also in df2), you would change the 'how' argument. see the pandas docs here for more info:
http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra