pandas dataset transformation to normalize the data

pandas dataset transformation to normalize the data - python

I have a csv file like this:
I want to transform it into a pandas dataframe like this:
Basically i'm trying to normalize the dataset to populate a sql table.
I have used json_normalize to create a separate dataset from genres column but I'm at a loss over how to transform both the columns as shown in the above depiction.
Some suggestions would be highly appreciated.

If the genre_id is the only numeric value (as shown in the picture), you can use the following:
#find all occurrences of digits in the column and convert the list items to comma separated string.
df['genre_id'] = df['genres'].str.findall(r'(\d+)').apply(', '.join)
#use pandas.DataFrame.explode to generate new genre_ids by comma separating them.
df = df.assign(genre_id = df.genre_id.str.split(',')).explode('genre_id')
#finally remove the extra space
df['genre_id'] = df['genre_id'].str.lstrip()
#if required create a new dataframe with these 2 columns only
df = df[['id','genre_id']]

Related

Remove miss-matched data in columns using Pandas

How would I remove data rows in a dataset which do not match the dictionary values of two columns. for example take the snippet of my data set.
I want to remove the rows where the data doesn't match the dictionary like in row 6 and apply this function to a large data set.
dictionary = SiteID, Location [188:A, 203:B, 206:C, 270:D, 463:E]

Try to do with merge
d = {188:'A', 203:'B', 206:'C', 270:'D', 463:'E'}
out = df.merge(pd.Series(d).rename_axis('SiteID').to_frame('Location').reset_index(),how='left')

Python(String): Using a string saved in a DataFrame cell as a pandas Formula

I have a dataframe InputDF shown below.
It has only one column Col1 (the 0'th column), whose values are all string:
I am trying to use the values in column 0 as a formula in pandas (mentioned below)
I have another empty data frame DF2where I try the following to insert data:
DF2 = InputDF.loc[0,0]
...this comes to DF2["column1"] = 'p_input[Order No]'
I need this to be DF2["column1"] = p_input[Order No]
so that I can save data available in "Order No" column of p_input dataframe in "columns1" column of "DF2" dataframe
[Note: p_input is another dataframe, due to some issues these assumptions can not change]

eval is your friend: DF2 = eval(p_input[Line Of Business])

Select a subset of an object type cell in panda Dataframe

I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...

Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)

Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)

Returning all rows in a dataframe after searching a column with a list of numbers - Python/Pandas

I have been trying to search a dataframe for a list of numbers, every time a number matches in a column I would like to return the whole row and save it to a new dataframe, and then to an excel.
millreflist is the list of numbers - can be of random length.
TUCABCP is the dataframe I am searching.
PO is the column I am searching in for the numbers.
I have tried the code below using .loc, but when opening the new excel file I am just getting the header and no rows or data.
millreflistlength = len(millreflist)
for i in range(millreflistlength): TUCABCP = TUCABCP.loc[TUCABCP['PO'] == millreflist[i]]
TUCABCP.to_excel("NEWBCP.xlsx", header=True, index=False)
I have used the following question for reference, but it does not cover when you would like to search with a list of numbers: Selecting rows from a Dataframe based on values in multiple columns in pandas

Try something like this:
## Get list, where each element is the index of a row which you want to keep
indexes = TUCABCP[TUCABCP['PO'].isin(millreflist)]
## Filter the original df to get just the rows with indexes in the list
df = TUCABCP[TUCABCP.index.isin(indexes)]

Excel 'COUNTIF() ' functionality using Python Pandas

How to implement the Excel 'COUNTIF()' using python
Please see the below image for the reference,
I have a column named 'Title' and it contains some text (CD,PDF). And I need to find the count of the string in the column as given below.
No.of CD : 4
No.of PDF: 1
By using Excel I could find the same by using the below formula
=COUNTIF($A$5:$A$9,"CD")
How can I do the same using python.

For a simple summary of list item counts, try .value_counts() on a Pandas data frame column:
my_list = ['CD','CD','CD','PDF','CD']
df['my_column'] = pd.DataFrame(my_list) # create data frame column from list
df['my_column'].value_counts()
... or on a Pandas series:
pd.Series(my_list).value_counts()
Having a column of counts can be especially useful for scrutinizing issues in larger datasets. Use the .count() method to create a column with corresponding counts (similar to using COUNTIF() to populate a column in Excel):
df['countif'] = [my_list.count(i) for i in my_list] # count list item occurrences and assign to new column
display(df[['my_column','countif']]) # view results

I guess you can do map to compare with "CD" then sum all the values
Example:
Create "title" data:
df = pd.DataFrame({"Title":["CD","CD","CD","PDF","CD"]})
The countif using map then sum
df["Title"].map(lambda x: int(x=="CD")).sum()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataset transformation to normalize the data - python

Related

Remove miss-matched data in columns using Pandas

Python(String): Using a string saved in a DataFrame cell as a pandas Formula

Select a subset of an object type cell in panda Dataframe

Returning all rows in a dataframe after searching a column with a list of numbers - Python/Pandas

Excel 'COUNTIF() ' functionality using Python Pandas

Categories

Resources