How to create a dataframe with simulated data in python

How to create a dataframe with simulated data in python - python

I have sample schema, which consists 12 columns, and each column has certain category. Now i need to simulate those data into a dataframe of around 1000 rows. How do i go about it?
I have used below code to generate data for each column
Location = ['USA','India','Prague','Berlin','Dubai','Indonesia','Vienna']
Location = random.choice(Location)
Age = ['Under 18','Between 18 and 64','65 and older']
Age = random.choice(Age)
Gender = ['Female','Male','Other']
Gender = random.choice(Gender)
and so on
I need the output as below
Location Age Gender
Dubai below 18 Female
India 65 and older Male
.
.
.
.

You can create each column one by one using np.random.choice:
df = pd.DataFrame()
N = 1000
df["Location"] = np.random.choice(Location, size=N)
df["Age"] = np.random.choice(Age, size=N)
df["Gender"] = np.random.choice(Gender, size=N)
Or do that using a list comprehension:
column_to_choice = {"Location": Location, "Age": Age, "Gender": Gender}
df = pd.DataFrame(
[np.random.choice(column_to_choice[c], 100) for c in column_to_choice]
).T
df.columns = list(column_to_choice.keys())
Result:
>>> print(df.head())
Location Age Gender
0 India 65 and older Female
1 Berlin Between 18 and 64 Female
2 USA Between 18 and 64 Male
3 Indonesia Under 18 Male
4 Dubai Under 18 Other

You can create a for loop for the number of rows you want in your dataframe and then generate a list of dictionary. Use the list of dictionary to generate the dataframe.
In [16]: for i in range(5):
...: k={}
...: loc = random.choice(Location)
...: age = random.choice(Age)
...: gen = random.choice(Gender)
...: k = {'Location':loc,'Age':age, 'Gender':gen}
...: list2.append(k)
...:
In [17]: import pandas as pd
In [18]: df = pd.DataFrame(list2)
In [19]: df
Out[19]:
Age Gender Location
0 Between 18 and 64 Other Berlin
1 65 and older Other USA
2 65 and older Male Dubai
3 Between 18 and 64 Male Dubai
4 Between 18 and 64 Male Indonesia

Related

Pandas: index-derived column with specific increments based on other columns

I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!

The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302

Matching lists to dataframes

I have a dataframe of people with Age as a column. I would like to match this age to a group, i.e. Baby=0-2 years old, Child=3-12 years old, Young=13-18 years old, Young Adult=19-30 years old, Adult=31-50 years old, Senior Adult=51-65 years old.
I created the lists that define these year groups, e.g. Adult=list(range(31,51)) etc.
How do I match the name of the list 'Adult' to the dataframe by creating a new column?
Small input: the dataframe is made up of three columns: df['Name'], df['Country'], df['Age'].
Name Country Age
Anthony France 15
Albert Belgium 54
.
.
.
Zahra Tunisia 14
So I need to match the age column with lists that I already have. The output should look like:
Name Country Age Group
Anthony France 15 Young
Albert Belgium 54 Adult
.
.
.
Zahra Tunisia 14 Young
Thanks!

IIUC I would go with np.select:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Age': [3, 20, 40]})
condlist = [df.Age.between(0,2),
df.Age.between(3,12),
df.Age.between(13,18),
df.Age.between(19,30),
df.Age.between(31,50),
df.Age.between(51,65)]
choicelist = ['Baby', 'Child', 'Young',
'Young Adult', 'Adult', 'Senior Adult']
df['Adult'] = np.select(condlist, choicelist)
Output:
Age Adult
0 3 Child
1 20 Young Adult
2 40 Adult

Here's a way to do that using pd.cut:
df = pd.DataFrame({"person_id": range(25), "age": np.random.randint(0, 100, 25)})
print(df.head(10))
==>
person_id age
0 0 30
1 1 42
2 2 78
3 3 2
4 4 44
5 5 43
6 6 92
7 7 3
8 8 13
9 9 76
df["group"] = pd.cut(df.age, [0, 18, 50, 100], labels=["child", "adult", "senior"])
print(df.head(10))
==>
person_id age group
0 0 30 adult
1 1 42 adult
2 2 78 senior
3 3 2 child
4 4 44 adult
5 5 43 adult
6 6 92 senior
7 7 3 child
8 8 13 child
9 9 76 senior
Per your question, if you have a few lists (like the ones below), and would like to convert use them for 'binning', you can do:
# for example, these are the lists
Adult = list(range(18,50))
Child = list(range(0, 18))
Senior = list(range(50, 100))
# Creating bins out of the lists.
bins = [min(l) for l in [Child, Adult, Senior]]
bins.append(max([max(l) for l in [Child, Adult, Senior]]))
labels = ["Child", "Adult", "Senior"]
# using the bins:
df["group"] = pd.cut(df.age, bins, labels=labels)

To make things more clear for beginners, you can define a function that will return the age group of each person accordingly, then use pandas.apply() to apply that function to our 'Group' column:
import pandas as pd
def age(row):
a = row['Age']
if 0 < a <= 2:
return 'Baby'
elif 2 < a <= 12:
return 'Child'
elif 12 < a <= 18:
return 'Young'
elif 18 < a <= 30:
return 'Young Adult'
elif 30 < a <= 50:
return 'Adult'
elif 50 < a <= 65:
return 'Senior Adult'
df = pd.DataFrame({'Name':['Anthony','Albert','Zahra'],
'Country':['France','Belgium','Tunisia'],
'Age':[15,54,14]})
df['Group'] = df.apply(age, axis=1)
print(df)
Output:
Name Country Age Group
0 Anthony France 15 Young
1 Albert Belgium 54 Senior Adult
2 Zahra Tunisia 14 Young

Read structured file in python

I have a file with data similar to this:
[START]
Name = Peter
Sex = Male
Age = 34
Income[2020] = 40000
Income[2019] = 38500
[END]
[START]
Name = Maria
Sex = Female
Age = 28
Income[2020] = 43000
Income[2019] = 42500
Income[2018] = 40000
[END]
[START]
Name = Jane
Sex = Female
Age = 41
Income[2020] = 60500
Income[2019] = 57500
Income[2018] = 54000
[END]
I want to read this data into a pandas dataframe so that at the end it is similar to this
Name Sex Age Income[2020] Income[2019] Income[2018]
Peter Male 34 40000 38500 NaN
Maria Female 28 43000 42500 40000
Jane Female 41 60500 57500 54000
So far, I wasn't able to figure out if this is a standard data file format (it has some similarities to JSON but is still very different).
Is there an elegant and fast way to read this data to a dataframe?

Elegant I do not know, but easy way, yes. Python is very good at parsing simple formatted text.
Here, [START] starts a new record, [END] ends it, and inside a record, you have key = value lines. You can easily build a custom parser to generate a list of records to feed into a pandas DataFrame:
inblock = False
fieldnames = []
data = []
for line in open(filename):
if inblock:
if line.strip() == '[END]':
inblock = False
elif '=' in line:
k, v = (i.strip() for i in line.split('=', 1))
record[k] = v
if not k in fieldnames:
fieldnames.append(k)
else:
if line.strip() == '[START]':
inblock = True
record = {}
data.append(record)
df = pd.DataFrame(data, columns=fieldnames)
df is as expected:
Name Sex Age Income[2020] Income[2019] Income[2018]
0 Peter Male 34 40000 38500 NaN
1 Maria Female 28 43000 42500 40000
2 Jane Female 41 60500 57500 54000

How to get the rolling mean and find the percentage of Male and Female in each occupation?

occupation gender number
administrator F 36
M 43
artist F 13
M 15
doctor M 7
educator F 26
M 69
How to get the rolling mean of first 2 column and find the average of (M)male and (F)female in each occupation
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
sep='|', index_col='user_id')
users.head()
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213

# create a data frame and apply count to gender
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'})
# create a DataFrame and apply count for each occupation
occup_count = users.groupby(['occupation']).agg('count')
# divide the gender_ocup per the occup_count and multiply per 100
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100
# present all rows from the 'gender column'
occup_gender.loc[: , 'gender']
courtesy
https://github.com/guipsamora/pandas_exercises/blob/master/03_Grouping/Occupation/Exercises_with_solutions.ipynb

Specifying column order following groupby aggregation

The ordering of my age, height and weight columns is changing with each run of the code. I need to keep the order of my agg columns static because I ultimately refer to this output file according to the column locations. What can I do to make sure age, height and weight are output in the same order every time?
d = pd.read_csv(input_file, na_values=[''])
df = pd.DataFrame(d)
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col).agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})
df_out.to_csv(output_file, sep=',')

I think you can use subset:
df_out = df.groupby(df.index_col)
.agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})[['age','height','weight']]
Also you can use pandas functions:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
Sample:
df = pd.DataFrame({'name':['q','q','a','a'],
'address':['a','a','s','s'],
'age':[7,8,9,10],
'height':[1,3,5,7],
'weight':[5,3,6,8]})
print (df)
address age height name weight
0 a 7 1 q 5
1 a 8 3 q 3
2 s 9 5 a 6
3 s 10 7 a 8
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
print (df_out)
age height weight
name address
a s 9.5 12 14
q a 7.5 4 8
EDIT by suggestion - add reset_index, here as_index=False does not work if need index values too:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
.reset_index()
print (df_out)
name address age height weight
0 a s 9.5 12 14
1 q a 7.5 4 8

If you care mostly about the order when written to a file and not while its still in a DataFrame object, you can set the columns parameter of the to_csv() method:
>>> df = pd.DataFrame(
{'age': [28,63,28,45],
'height': [183,156,170,201],
'weight': [70.2, 62.5, 65.9, 81.0],
'name': ['Kim', 'Pat', 'Yuu', 'Sacha']},
columns=['name','age','weight', 'height'])
>>> df
name age weight height
0 Kim 28 70.2 183
1 Pat 63 62.5 156
2 Yuu 28 65.9 170
3 Sacha 45 81.0 201
>>> df_out = df.groupby(['age'], as_index=False).agg(
{'weight': sum, 'height': sum})
>>> df_out
age height weight
0 28 353 136.1
1 45 201 81.0
2 63 156 62.5
>>> df_out.to_csv('out.csv', sep=',', columns=['age','height','weight'])
out.csv then looks like this:
,age,height,weight
0,28,353,136.10000000000002
1,45,201,81.0
2,63,156,62.5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a dataframe with simulated data in python - python

Related

Pandas: index-derived column with specific increments based on other columns

Matching lists to dataframes

Read structured file in python

How to get the rolling mean and find the percentage of Male and Female in each occupation?

Specifying column order following groupby aggregation

Categories

Resources