I have a data frame of 6 columns that 2 first columns should be plotted as x & y. I want to replace the values of the 6th column with other values and then excluding x, y that have values larger than a threshold like 0.0003-0.002. The effort that I had is below:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('configuration_1000.out', sep="\s+", header=None)
#print(df)
col_5 = df.iloc[:,5]
g = col_5.abs()
g = g*0.00005
#print(g)
df.loc[:,5].replace(g, inplace=True)
#df.head()
selected = df[ (df.loc[:,5] > 0.0003) & (df.loc[:,5] < 0.002) ]
print(selected)
plt.plot(selected[0], selected[1],marker=".")
but when I do this, nothing is gonna changed.
You don't need iloc for this, nor do you need to go through the intermediate steps. Just manipulate the column directly.
df[df.columns[5]] = abs(df[df.columns[5]])*0.00005
To solve this problem just need to do this
df.loc[:,5] = g
Related
I have two datasets. Below you can see codes and data
import pandas as pd
import numpy as np
pd.set_option('max_columns', None)
import matplotlib.pyplot as plt
data = {'type_sale': ['group_1','group_2','group_3','group_4','group_5','group_6','group_7','group_8','group_9','group_10'],
'id':[70,20,24,80,20,20,60,20,20,20],
}
df1 = pd.DataFrame(data, columns = ['type_sale',
'id',])
data = {'type_sale': ['group_1','group_2','group_3'],
'id':[70,20,24],
}
df2 = pd.DataFrame(data, columns = ['type_sale',
'id',])
These codes created two datasets that are shown below :
Now I want to create a new data set df3 with values from df1 that are different (distinct values) from the values df2 in the column id.
The final results should as pic below
I tried with these codes but are not giving desired results.
df = pd.concat((df1, df2))
print(df.drop_duplicates('id'))
So can anybody help me how to solve this problem?
Try as follows:
Use df.isin to check for each value in df['id'] whether it is contained in df2['id'].
Next, invert the resulting boolean pd.Series by using the unary operator ~ (tilde) and select from d1.
Finally, reset the index.
In a one-liner:
df3 = df1[~df1['id'].isin(df2['id'])].reset_index(drop=True)
print(df3)
type_sale id
0 group_4 80
1 group_7 60
import polars as pl
import pandas as pd
A = ['a','a','a','a','a','a','a','b','b','b','b','b','b','b']
B = [1,2,3,4,5,6,7,8,9,10,11,12,13,14]
df = pl.DataFrame({'cola':A,
'colb':B})
df_pd = df.to_pandas()
index = df_pd.groupby('cola')['colb'].idxmax()
df_pd.loc[index,'top'] = 1
in pandas i can get the column of top using idxmax().
however, in polars
i use the arg_max()
index = df[pl.col('colb').arg_max().over('cola').flatten()]
seems cannot get what i want..
is there any way to get generate a column of 'top' in polars?
thx a lot!
In Polars, window functions (the .over()) will do an aggregation + self-join (see https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.over.html?highlight=over#polars.Expr.over), which means you cannot return a unique value per row, which is what you are after.
A way to compute the top column is to use apply:
df.groupby("cola").apply(lambda x: x.with_columns([pl.col("colb"), (pl.col("colb")==pl.col("colb").max()).alias("top")]))
For example, let's take Penguins dataset, and i want to drop all entries in bill_length_mm column when they are more then 30:
import seaborn as sns
import pandas as pd
ds = sns.load_dataset("penguins")
ds.head()
ds.drop(ds[ds['bill_length_mm']>30])
And it gives me an error. And if i'll try to add axis=1 it'll just drop every column in dataset.
ds.drop(ds[ds['bill_length_mm']>30], axis=1)
So what shoud i do to complete ma goal?
Try
ds=ds.drop(ds[ds['bill_length_mm']>30].index)
Or
ds = ds[ds['bill_length_mm']<=30]
ds.drop is used to drop columns, not rows. If you only want to keep the rows where bill_length_mm<=30, you can use
ds = ds[ds['bill_length_mm']<=30]
Using python3 I wrote a code for calculating data. Code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def data(symbols):
dates = pd.date_range('2016/01/01','2016/12/23')
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv("/home/furqan/Desktop/Data/{}.csv".format(symbol),
index_col='Date',parse_dates=True,usecols=['Date',"Close"],
na_values = ['nan'])
df_temp=df_temp.rename(columns={'Close':symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
df=(df/df.ix[0,: ])
return df
symbols = ['FABL','HINOON']
df=data(symbols)
print(df)
p_value=(np.zeros((2,2),dtype="float"))
p_value[0,0]=0.5
p_value[1,1]=0.5
print(df.shape[1])
print(p_value.shape[0])
df=np.dot(df,p_value)
print(df.shape[1])
print(df.shape[0])
print(df)
When I print df for second time the index has vanished. I think the issue is due to matrix multiplication. How can I get the indexing and column headings back into df?
To resolve your issue, because you are using numpy methods, these typically return a numpy array which is why any existing columns and index labels will have been lost.
So instead of
df=np.dot(df,p_value)
you can do
df=df.dot(p_value)
Additionally because p_value is a pure numpy array, there is no column names here so you can either create a df using existing column names:
p_value=pd.DataFrame(np.zeros((2,2),dtype="float"), columns = df.columns)
or just overwrite the column names directly after calculating the dot product like so:
df.columns = ['FABL', 'HINOON']
I have a pandas data frame where there are a several missing values. I noticed that the non missing values are close to each other. Thus, I would like to impute the missing values by randomly choosing the non missing values.
For instance:
import pandas as pd
import random
import numpy as np
foo = pd.DataFrame({'A': [2, 3, np.nan, 5, np.nan], 'B':[np.nan, 4, 2, np.nan, 5]})
foo
A B
0 2 NaN
1 3 4
2 NaN 2
3 5 NaN
4 NaN 5
I would like for instance foo['A'][2]=2 and foo['A'][5]=3
The shape of my pandas DataFrame is (6940,154).
I try this
foo['A'] = foo['A'].fillna(random.choice(foo['A'].values.tolist()))
But it not working. Could you help me achieve that? Best regards.
You can use pandas.fillna method and the random.choice method to fill the missing values with a random selection of a particular column.
import random
import numpy as np
df["column"].fillna(lambda x: random.choice(df[df[column] != np.nan]["column"]), inplace =True)
Where column is the column you want to fill with non nan values randomly.
This works well for me on Pandas DataFrame
def randomiseMissingData(df2):
"randomise missing data for DataFrame (within a column)"
df = df2.copy()
for col in df.columns:
data = df[col]
mask = data.isnull()
samples = random.choices( data[~mask].values , k = mask.sum() )
data[mask] = samples
return df
I did this for filling NaN values with a random non-NaN value:
import random
df['column'].fillna(random.choice(df['column'][df['column'].notna()]), inplace=True)
This is another approach to this question after making improvement on the first answer and according to how to check if an numpy int is nand found here in numpy documentation
foo['A'].apply(lambda x: np.random.choice([x for x in range(min(foo['A']),max(foo['A'])]) if (np.isnan(x)) else x)
Here is another Pandas DataFrame approach
import numpy as np
def fill_with_random(df2, column):
'''Fill `df2`'s column with name `column` with random data based on non-NaN data from `column`'''
df = df2.copy()
df[column] = df[column].apply(lambda x: np.random.choice(df[column].dropna().values) if np.isnan(x) else x)
return df
for me only this worked, all the examples above failed.
Some filled same number, some didn't fill nothing.
def fill_sample(df, col):
tmp = df[df[col].notna()[col].sample(len(df[df[col].isna()])).values
k = 0
for i,row in df[df[col].isna()].iterrows():
df.at[i, col] = tmp[k]
k+=1
return df
Not the most concise, but probably the most performant way to go:
nans = df[col].isna()
non_nans = df.loc[df[col].notna(), col]
samples = np.random.choice(non_nans, size=nans.sum())
df.loc[nans, col] = samples