This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 14 days ago.
I have a data frame with one string column and I'd like to split it into multiple columns by seperate with
','. I want to name the column as same as the string in the column before ':'.
The column looks like this:
0 {"ID":"AP001","Name":"Anderson","Age":"23"}
1 {"ID":"AP002","Name":"Jasmine","Age":"36"}
2 {"ID":"AP003","Name":"Zack","Age":"28"}
3 {"ID":"AP004","Name":"Chole","Age":"39"}
And I want to split to this:
ID
Name
Age
AP001
Anderson
23
AP002
Jasmine
36
AP003
Zack
28
AP004
Chole
39
I have tried to split it by ',', but im not sure how to remove the string before ':' and put it as the column name.
data1 = data['demographic'].str.split(',',expand=True)
This is what I get after splitting it:
0
1
2
"ID":"AP001"
"Name":"Anderson"
"Age":"23"
"ID":"AP002"
"Name":"Jasmine"
"Age":"36"
"ID":"AP003"
"Name":"Zack"
"Age":"28"
"ID":"AP004"
"Name":"Chole"
"Age":"39"
Anyone knows how to do it?
You can use ast.literal_eval:
import ast
data1 = pd.json_normalize(data['demographic'].apply(ast.literal_eval))
print(data1)
# Output
ID Name Age
0 AP001 Anderson 23
1 AP002 Jasmine 36
2 AP003 Zack 28
3 AP004 Chole 39
Related
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 8 months ago.
I have data that contains several rows for each employee. Each row contains one attribute and its value. For example:
Worker ID
Last Name
First Name
Metric Name
Metric Value
1
Hanson
Scott
Attendance
98
1
Hanson
Scott
On time
35
2
Avery
Kara
Attendance
95
2
Avery
Kara
On time
57
I would like to combine rows based on worker id, taking metrics to their own columns like so:
Worker ID
Last Name
First Name
Attendance
On time
1
Hanson
Scott
98
35
2
Avery
Kara
95
57
I can do worker_data.pivot_table(values='Metric Value', index='Worker ID', columns=['Metric Name']), but that does not give me the first and last names as columns. What is the best Pandas way to merge these rows?
In your solution change index parameter by list and for avoid MultiIndex remove [] from column parameter:
df = (worker_data.pivot_table(index=['Worker ID','Last Name','First Name'],
columns='Metric Name',
values='Metric Value')
.reset_index()
.rename_axis(None, axis=1))
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
Have a df with values
name marks subject
mark 50 math
mark 75 french
tom 25 english
tom 30 Art
luca 100 math
luca 100 art
How to make a transpose of a dataframe so it looks like this
name math art french english
mark 50 75
tom 30 25
luca 100 100
tried:
df.T and df[['marks','subject']].T
but
This is a pivot. First we need to normalize the subject column, then we pivot.
df['subject'] = df['subject'].str.lower()
df.pivot(index='name', columns='subject', values='marks')
See here for more info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot
This question already has answers here:
Customizing the separator in pandas read_csv
(4 answers)
Closed 2 years ago.
I have a .txt file that is separated as follows for multiple rows:
Vermont;VT;Tunbridge;95000204;Republican;John Kasich;36;0.319
When read with pandas I only get 1 column.
How do I split the data in python so that each separated value is a different column in a pandas dataframe
Thanks
Like this (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html for more)
import pandas as pd
df = pf.read_csv('data.csv',sep=';')
print(df)
where data.csv is
Vermont;VT;Tunbridge;95000204;Republican;John Kasich;36;0.319
NewYork;VT;Tunbridge;95000204;Republican;John Kasich;36;0.88
output
Vermont VT Tunbridge ... John Kasich 36 0.319
0 NewYork VT Tunbridge ... John Kasich 36 0.88
[1 rows x 8 columns]
This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 2 years ago.
I am asking for any other algorithm or method that you would use to detect anomalies on a single column.
Filtering by columns not showing the data.
I am using the following approach to limit my dataframe only to two columns
X=pd.read_csv(‘C:/Users/Path/file.csv’, usecols=[“Describe_File”, "numbers"])
Describe_File numbers
0 This is the start 25
1 Ending is coming 42
2 Middle of the story 525
3 This is the start 65
4 This is the start 25
5 Middle of the story 35
6 This is the start 28
7 This is the start 24
8 Ending is coming 24
9 Ending is coming 35
10 Ending is coming 25
11 Ending is coming 24
12 This is the start 215
Now I want to go to column ** Describe_File** , filter by the string This is the start and then show my the values of numbers
To do so I usually use the following code, by for some reason it is not giving me anything. The string exists on my csv file
X = X[X.Describe_File == "This is the start"]
You can use the .str.contains() - vectorized substring search, i.e.
df = X[X.Describe_File.str.contains("This is the start", regex=False)]
main_df:
Name Age Id DOB
0 Tom 20 A4565 22-07-1993
1 nick 21 G4562 11-09-1996
2 krish AKL F4561 15-03-1997
3 636A 18 L5624 06-07-1995
4 mak 20 K5465 03-09-1997
5 nits 55 56541 45aBc
6 444 66 NIT 09031992
column_info_df:
Column_Name Column_Type
0 Name string
1 Age integer
2 Id string
3 DOB Date
how can i find data type error value from main df. For example from column info df we can see 'Name' is a string column, so in main df, 'Name' column should contain either string or alphanumeric other than that it's an error. I need to find those datatype error values in a separate df.
error output df:
Column_Name Current_Value Exp_Dtype Index_No.
0 Name 444 string 6
1 Age 444 int 2
2 Name 56441 string 6
0 DOB 4aBc Date 5
0 DOB 09031992 Date 6
i tried this:
for i,r in column_info_df.iterrows():
if r['Column_Type'] == 'string':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^a-z|A-Z]+')]
elif r['Column_Type'] == 'integer':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^0-9]+')]
elif r['Column_Type'] == 'Date':
i have stuck here,because this RE is not catching every errors. i don't know how to go further?
Here is one way of using df.eval(),
Note: though this will check based on pattern and return non matching values. However, note that this cannot check valid types, example if date column has an entry which looks like a date but is an invalid date, this wouldnot identify that:
d={"string":".str.contains(r'[a-z|A-Z]')","integer":".str.contains('^[0-9]*$')",
"Date":".str.contains('\d\d-\d\d-\d\d\d\d')"}
m=df.eval([f"~{a}{b}"
for a,b in zip(column_info_df['Column_Name'],column_info_df['Column_Type'].map(d))]).T
final=(pd.DataFrame(np.where(m,df,np.nan),columns=df.columns)
.reset_index().melt('index',var_name='Column_Name',
value_name='Current_Value').dropna())
final['Expected_dtype']=(final['Column_Name']
.map(column_info_df.set_index('Column_Name')['Column_Type']))
print(final)
Output:
index Column_Name Current_Value Expected_dtype
6 6 Name 444 string
9 2 Age AKL integer
19 5 Id 56541 string
26 5 DOB 45aBc Date
27 6 DOB 09031992 Date
I agree there can be better regex patterns for this job but the idea should be same.
If I understood what you did, you created separate dataframes, which contains infos about your main one.
What I suggest would be instead to use the build-in methods offered by pandas to deal with dataframes.
For instance, if you have a dataframe main, then:
main.info()
will give you the type of object for each column. Note that a column can contain only one type, as it is a series, which is itself a ndarray.
So your column name cannot have anything else but strings that you would have missed. Instead, you can have NaN values. You can check for them with the help of
main.describe()
I hope that helped :-)