Creating a dataframe in Python - python

Not sure how to correctly phrase this, but here goes:
What's the easiest way to create a one-column dataframe in Python that holds ones and zeros, and the length is determined by some input?
For example, say that I have a sample size of 1000, of which a 100 successes (ones). The amount of zeros would then be the sample size (i.e., 1000) minus successes. So the output would be a df with a length of 1000, of which a 100 rows contain a one and 900 a zero.

From what you describe, a simple list would do the trick. Otherwise, you can use numpy.array or pandas.DataFrame/pandas.Series (more table-like).
import numpy as np
import pandas as pd
input_length = 1000
# List approach:
my_list = [0 for i in range(input_length)]
# Numpy array:
my_array = np.zeros(input length)
# With Pandas:
my_table = pd.Series(0, index=range(input_length))
All these create a vector of zeroes, then you assign the successes (ones) as you please. If these were to follow some known distribution, numpy also has methods to generate random vectors that follow them (see here).
If you're really looking for the pandas approach, it can also be combined with the previous ones. This is, you can assign a list or numpy.array to the values of your Series/DataFrame. For example, imagine you want to draw 1000 random samples of a binomial distribution with p=0.5:
p=0.5
my_data = pd.Series(np.random.binomial(1, p, input_length))

In addition to N.P.'s answer. You could do something like this:
import pandas as pd
import numpy as np
def generate_df(df_len):
values = np.random.binomial(n=1, p=0.1, size=df_len)
return pd.DataFrame({'value': values})
df = generate_df(1000)
edit:
More complete function:
def generate_df(df_len, option, p_success=0.1):
'''
Generate a pandas DataFrame with one single field filled with
1s and 0s in p_success proportion and length df_len.
Input:
- df_len: int, length of the 1st dimension of the DataFrame
- option: string, determines how will the sample be generated
* random: according to a bernoully distribution with p=p_success
* fixed: failures first, and fixed proportion of successes p_success
* fixed_shuffled: fixed proportion of successes p_success, random order
- p_success: proportion of successes among total
Output:
- df: pandas Dataframe
'''
if option == 'random':
values = np.random.binomial(n=1, p=p_success, size=df_len)
elif option in ('fixed_shuffled', 'fixed'):
n_success = int(df_len*p_success)
n_fail = df_len - n_success
values = [0]*n_fail + [1]*n_success
if option == 'fixed_shuffled':
np.random.shuffle(values)
else:
raise Exception('Unknown option: {}'.format(option))
df = pd.DataFrame({'value': values})
return df

Related

How do I save a N x M array/list using Pandas?

I have a N x M numpy array / list. I want to save this matrix into a .csv file using Pandas. Unfortunately I don't know a priori the values of M and N which can be large. I am interested in Pandas because I find it manageable in terms of data columns access.
Let's start with this MWE:
import numpy as np
import pandas as pd
N,M = np.random.randint(10,100, size = 2)
A = np.random.randint(10, size = (N,M))
columns = []
for i in range(len(A[0,:])):
columns.append( "column_{} ".format(i) )
I cannot do something like pd.append( ) i.e. appending columns with new additional indices via a for loop.
Is there a way to save A into a .csv file?
Following the comment of Quang Hoang, there are 2 possibilities:
pd.DataFrame(A).to_csv('yourfile.csv').
np.save("yourfile.npy",A) and then A = np.load("yourfile.npy").

Taking first value in a rolling window that is not numeric

This question follows one I previously asked here, and that was answered for numeric values.
I raise this 2nd one now relative to data of Period type.
While the example given below appears simple, I have actually windows that are of variable size. Interested in the 1st row of the windows, I am looking for a technic that makes use of this definition.
import pandas as pd
from random import seed, randint
# DataFrame
pi1h = pd.period_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in pi1h]
df = pd.DataFrame({'Values' : values, 'Period' : pi1h}, index=pi1h)
# This works (numeric type)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
# This doesn't (Period type)
df['OpeningPeriod'] = df['Period'].rolling(3).agg(lambda rows: rows[0])
Result of 2nd command
DataError: No numeric types to aggregate
Please, any idea? Thanks for any help! Bests,
First row of rolling window of size 3 means row, which is 2 rows above the current - just use pd.Series.shift(2):
df['OpeningPeriod'] = df['Period'].shift(2)
For the variable size (for the sake of example- I took Values column as this variable size):
import numpy as np
x=(np.arange(len(df))-df['Values'])
df['OpeningPeriod'] = np.where(x.ge(0), df.loc[df.index[x.tolist()], 'Period'], np.nan)
Convert your period[H] to a float
# convert to float
df['Period1'] = df['Period'].dt.to_timestamp().values.astype(float)
# rolling and convert back to period
df['OpeningPeriod'] = pd.to_datetime(df['Period1'].rolling(3)\
.agg(lambda rows: rows[0])).dt.to_period('1h')
# drop column
df = df.drop(columns='Period1')

How to remove an line from a numpy array/table if it is empty

I have numpy arrays which are around 2000 long each, but not every element has a value. Some are blank. As you can see at the end of the code ive stacked them into one called 'match'. How would I remove a row in match if it is missing an element. So for example if a particular ID is missing the magnitude it removes the entire row. I'm only interested in keeping the rows that have data for all of the elements.
from astropy.table import Table
import numpy as np
data = '/home/myname/datable.fits'
data = Table.read(data, format="fits")
ID = np.array(data['ID'])
ID.astype(str)
redshift = np.array(data['z'])
redshift.astype(float)
radius = np.array(data['r'])
radius.astype(float)
mag = np.array(data['MAG'])
mag.astype(float)
match = (ID, redshift, radius, mag)
np.stack(match, axis=1)
Here you can use the numpy.isnan method which gives true for missing values and false for existing values. But numpy.isnan can be applied to NumPy arrays of native dtype (such as np.float64).
Your requirement can be achieved as follows:
Note: considering data is your numpy array.
import numpy as np
data = np.array(some_array) # set data as your numpy array
key_col = np.array(data[:,0], dtype=np.float64) # If you want to filter based on column 0
filtered_data = data[~np.isnan(key_col)] # ~ is the logical not here
For better flexibility, consider using pandas!!
Hope this helps!!

Generate numpy array with duplicate rate

Here is my problem: i have to generate some synthetic data (like 7/8 columns), correlated each other (using pearson coefficient). I can do this easily, but next i have to insert a percentage of duplicates in each column (yes, pearson coefficent will be lower), different for each column.
The problem is that i don't want to insert personally that duplicates, cause in my case it would be like to cheat.
Someone knows how to generate correlated data already duplicates ? I've searched but usually questions are about drop or avoid duplicates..
Language: python3
To generate correlated data i'm using this simple code: Generatin correlated data
Try something like this :
indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))
for index in indices:
array.append(array[index])
Here I make the assumption that your data is stored in array which is an ndarray, where each row contains your 7/8 columns of data.
The above code should create an array of random indices, whose entries (rows) you select and append again to the array.
i find out the solution.
I post the code, it might be helpful for someone.
#this are the data, generated randomically with a given shape
rnd = np.random.random(size=(10**7, 8))
#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)
#I added other 7 columns, with varing range of values (all upper than 0.7)
attr1 = np.random.uniform(0.8, .95, size = (8,1))
#attr2,3,4,5,6,7 like attr1
#corr_mat is the matrix, union of columns
corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))
from statsmodels.stats.correlation_tools import cov_nearest
#using that function i found the nearest covariance matrix to my matrix,
#to be sure that it's positive definite
a = cov_nearest(corr_mat)
from scipy.linalg import cholesky
upper_chol = cholesky(a)
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
#ans now has randomically correlated data (high correlation, but is customizable)
#next i create a pandas Dataframe with ans values
df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4',
'att5', 'att6', 'att7', 'att8'])
#last step is to truncate float values of ans in a variable way, so i got
#duplicates in varying percentage
a = df.values
for i in range(8):
trunc = np.random.randint(5,12)
print(trunc)
a.T[i] = a.T[i].round(decimals=trunc)
#float values of ans have 16 decimals, so i randomically choose an int
# between 5 and 12 and i use it to truncate each value
Finally, those are my duplicates percentages for each column:
duplicate rate attribute: att1 = 5.159390000000002
duplicate rate attribute: att2 = 11.852260000000001
duplicate rate attribute: att3 = 12.036079999999998
duplicate rate attribute: att4 = 35.10611
duplicate rate attribute: att5 = 4.6471599999999995
duplicate rate attribute: att6 = 35.46553
duplicate rate attribute: att7 = 0.49115000000000464
duplicate rate attribute: att8 = 37.33252

Numpy: Finding correspondencies in one array by uniques of other array, arbitrary length

I have a problem where I have two arrays, one with identifiers which can occur multiple time, lets just say
import numpy as np
ind = np.random.randint(0,10,(100,))
and another one which is the same length and contains some info, in this case boolean, for each of the elementes identified by ind. They are sorted correspondingly.
dat = np.random.randint(0,2,(100,)).astype(np.bool8)
I'm looking for a (faster?) way to do the following: Do a np.any() for each element (defined by ind) for all elements. The number of occurences per element is, as in the example, random. What I'm doing now is
result = np.empty(np.unique(ind))
for i,uni in enumerate(np.unique(ind)):
result[i] = np.any(dat[ind==uni])
Which is sort of slow. Any ideas?
Approach #1
Index ind with dat to select the ones required to be checked, get the binned counts with np.bincount and see which bins have more one than occurrence -
result = np.bincount(ind[dat])>0
If ind has negative numbers, offset it with the min value -
ar = ind[dat]
result = np.bincount(ar-ar.min())>0
Approach #2
One more with np.unique -
unq = np.unique(ind[dat])
n = len(np.unique(ind))
result = np.zeros(n,dtype=bool)
result[unq] = 1
We can use pandas to get n :
import pandas as pd
n = pd.Series(ind).nunique()
Approach #3
One more with indexing -
ar = ind[dat]
result = np.zeros(ar.max()+1,dtype=bool)
result[ar] = 1

Categories