I have a file with data similar to this:
[START]
Name = Peter
Sex = Male
Age = 34
Income[2020] = 40000
Income[2019] = 38500
[END]
[START]
Name = Maria
Sex = Female
Age = 28
Income[2020] = 43000
Income[2019] = 42500
Income[2018] = 40000
[END]
[START]
Name = Jane
Sex = Female
Age = 41
Income[2020] = 60500
Income[2019] = 57500
Income[2018] = 54000
[END]
I want to read this data into a pandas dataframe so that at the end it is similar to this
Name Sex Age Income[2020] Income[2019] Income[2018]
Peter Male 34 40000 38500 NaN
Maria Female 28 43000 42500 40000
Jane Female 41 60500 57500 54000
So far, I wasn't able to figure out if this is a standard data file format (it has some similarities to JSON but is still very different).
Is there an elegant and fast way to read this data to a dataframe?
Elegant I do not know, but easy way, yes. Python is very good at parsing simple formatted text.
Here, [START] starts a new record, [END] ends it, and inside a record, you have key = value lines. You can easily build a custom parser to generate a list of records to feed into a pandas DataFrame:
inblock = False
fieldnames = []
data = []
for line in open(filename):
if inblock:
if line.strip() == '[END]':
inblock = False
elif '=' in line:
k, v = (i.strip() for i in line.split('=', 1))
record[k] = v
if not k in fieldnames:
fieldnames.append(k)
else:
if line.strip() == '[START]':
inblock = True
record = {}
data.append(record)
df = pd.DataFrame(data, columns=fieldnames)
df is as expected:
Name Sex Age Income[2020] Income[2019] Income[2018]
0 Peter Male 34 40000 38500 NaN
1 Maria Female 28 43000 42500 40000
2 Jane Female 41 60500 57500 54000
Related
I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Hello this is my csv data
Age Name
0 22 George
1 33 lucas
2 22 Nick
3 12 Leo
4 32 Adriano
5 53 Bram
6 11 David
7 32 Andrei
8 22 Sergio
i want to use if else statement , for example if George is adult create new row and insert +
i mean
Age Name Adul
22 George +
What is best way?
This is my code Which i am using to read data from csv
import pandas as pd
produtos = pd.read_csv('User.csv', nrows=9)
print(produtos)
for i, produto in produtos.iterrows():
print(i,produto['Age'],produto['Name'])
IIUC, you want to create a new column (not row) call "Adul". You can do this with numpy.where:
import numpy as np
produtos["Adul"] = np.where(produtos["Age"].ge(18), "+", np.nan)
Edit:
To only do this for a specific name, you could use:
name = input("Name")
if name in produtos["Name"].tolist():
if produtos.loc[produtos["Name"]==name, "Age"] >= 18:
produtos.loc[produtos["Name"]==name, "Adul"] = "+"
You can do this:
produtos["Adul"] = np.where(produtos["Age"] >= 18, "+", np.nan)
I would need to create a new column with data extracted from another column.
Name Surname Age
Nivea Jones 45
Kelly Pams 68
Matthew Currigan 24
...
I would like to create a new column with only the first letter from the name and surname, i.e.
Name Surname Age Short FN
Nivea Jones 45 NJ
Kelly Pams 68 KP
Matthew Currigan 24 MC
...
I did as follows:
df['Short FN'] = df['Name'].str.get(0) +df['Surname'].str.get(0)
and it works well. However, I would need to build a function, with two columns (in this case, name and surname) as parameters:
def sh(x,y):
df['Short FN'] = df[x].str.get(0) +df[y].str.get(0)
return
and it does not work, probably because I should keep in mind that I am using columns from a dataframe as parameter. Also, I do not know if and what I should return.
Could you please explain me how to create a function where I check/pass columns and how to use this function (not clear to me if I need to iterate through rows using a for loop)?
You can do this:
def sh(x, y):
return x[0] + y[0]
df['Short'] = df.apply(lambda x: sh(x['Name'], x['Surname']), axis=1)
print(df)
Name Surname Age Short
0 Nivea Jones 45 NJ
1 Kelly Pams 68 KP
2 Matthew Currigan 24 MC
There are several ways to do that. The simplest way, assuming df is global (as it seems to be in your case), is:
def short_name(col1, col2):
return df[col1].str[0] + df[col2].str[0]
calling short_name("Name", "Surname")
produces:
0 NJ
1 KP
2 MC
dtype: object
You can now use it in whatever way you want. For example:
df["sn"] = short_name("Name", "Surname")
print(df)
# produces:
Name Surname Age sn
0 Nivea Jones 45 NJ
1 Kelly Pams 68 KP
2 Matthew Currigan 24 MC
I have sample schema, which consists 12 columns, and each column has certain category. Now i need to simulate those data into a dataframe of around 1000 rows. How do i go about it?
I have used below code to generate data for each column
Location = ['USA','India','Prague','Berlin','Dubai','Indonesia','Vienna']
Location = random.choice(Location)
Age = ['Under 18','Between 18 and 64','65 and older']
Age = random.choice(Age)
Gender = ['Female','Male','Other']
Gender = random.choice(Gender)
and so on
I need the output as below
Location Age Gender
Dubai below 18 Female
India 65 and older Male
.
.
.
.
You can create each column one by one using np.random.choice:
df = pd.DataFrame()
N = 1000
df["Location"] = np.random.choice(Location, size=N)
df["Age"] = np.random.choice(Age, size=N)
df["Gender"] = np.random.choice(Gender, size=N)
Or do that using a list comprehension:
column_to_choice = {"Location": Location, "Age": Age, "Gender": Gender}
df = pd.DataFrame(
[np.random.choice(column_to_choice[c], 100) for c in column_to_choice]
).T
df.columns = list(column_to_choice.keys())
Result:
>>> print(df.head())
Location Age Gender
0 India 65 and older Female
1 Berlin Between 18 and 64 Female
2 USA Between 18 and 64 Male
3 Indonesia Under 18 Male
4 Dubai Under 18 Other
You can create a for loop for the number of rows you want in your dataframe and then generate a list of dictionary. Use the list of dictionary to generate the dataframe.
In [16]: for i in range(5):
...: k={}
...: loc = random.choice(Location)
...: age = random.choice(Age)
...: gen = random.choice(Gender)
...: k = {'Location':loc,'Age':age, 'Gender':gen}
...: list2.append(k)
...:
In [17]: import pandas as pd
In [18]: df = pd.DataFrame(list2)
In [19]: df
Out[19]:
Age Gender Location
0 Between 18 and 64 Other Berlin
1 65 and older Other USA
2 65 and older Male Dubai
3 Between 18 and 64 Male Dubai
4 Between 18 and 64 Male Indonesia