Restructuring Pandas Dataframe for large number of columns - python

I have a pandas dataframe which is a large number of answers given by users in response to a survey and I need to re-structure it. There are up to 105 questions asked each year, but I only need maybe 20 of them.
The current structure is as below.
What I want to do is re-structure it so that the row values become column names and the answer given by the user is then the value in that column. In a picture (from Excel), what I want is the below (I know I'll need to re-name my columns, but that's fine once I can create the structure in the first place):
Is it possible to re-structure my dataframe this way? The outcome of this is to use some predictive analytics to predict a target variable, so I need to re-strcture before I can use Random Forest, kNN, and so on.

You might want try pivoting your table:
df.pivot(index=['SurveyID', 'UserID'], columns=['QuestionID'], values=['AnswerText'])
df.columns = [x[0] if x[1] == "" else "Answer_{}".format(x[1]) for x in df.columns.to_flat_index()]

Related

Pandas How do I create a list of duplicates from one column, and only keep the highest value for the corresponding columns?

I want to find all of the duplicates in the first column Primary Mod Site and only keep the highest value for all of the compounds (columns B-M) in the dataset.
excel sheet
For code, I have:
#read desired excel file
df = pd.read_excel("20220825_CISLIB01_Plate-1_Rows-A-B")
#function to find the duplicates in the dataset, sections them, and remove them
#can be applied to any dataset with the same format as original excel files
def getDuplicate():
gene = df["Primary Mod Site"]
#creates a list of all of the duplicates in Primary Mod Site
pd.concat(g for _, g in df.groupby("gene") if len(g) > 1)
Im stuck on what to do next. Help much appreciated!
it helps if you post the data as code or text, to allow to reproduce.
but, IIUC, you need to groupby the column 'A' and then take the max from rest of the columns, this seems to do the trick
df["Primary Mod Site"].max()
Based on what i noticed in the screenshot (3 first rows for example), the row with the highest values tends to have the highest value in all columns, sooo, something like this might work.
df = df.sort_values("ONCV-1-1-1", ascending = False).drop_duplicates("Primary Mod Site", keep='first', ignore_index=True)
or if not sure if that observation is correct for all rows.
probably this would work
df = df.groupby("Primary Mod Site").max()
NB: please post a reproducible example, easy to copy paste for us to test.

How should I select rows of a pandas dataframe whose entries start with a certain string?

Apologies if this is contained in a previous answer but I've read this one: How to select rows from a DataFrame based on column values? and can't work out how to do what I need to do:
Suppose have some pandas dataframe X and one of the columns is 'timestamp'. The entries are formatted like '2010-11-03 09:44:05'. I want to select just those rows that correspond to a specific day, for example, select just those rows for which the actual string in timestamp column starts with '2010-11-03'. Is there a neat way to do this? Can I do it with a mask or Boolean indexing? Or should I just write a separate line to peel off the day from each entry and then select the rows? Bear in mind the dataframe is large if it helps.
i.e. I want to write something like
X.loc[X['timestamp'].startswith('2010-11-03')]
or
mask = '2010-11-03' in X["timestamp"]
but these don't actually make any sense.
This should work:-
X[X['timestamp'].str.startswith('2010-11-03')]

How to update/apply validation to pandas columns

I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.

Cryptocurrency correlation in python, working with dictionaries

I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.
I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?
Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I
To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:
Pandas can return a correlation matrix of all columns with:
df.corr()
If you want to use only specific columns, select those from the DataFrame:
df[["col1", "col2"]].corr()
You can return a single correlation value between two columns with the form:
df["col1"].corr(df["col2"])
If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).
In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".
EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].
Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.
df = df.T
Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.
result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
coin_data = df[coin]
split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
result = pd.concat([result, split_coin], axis=1)

Create Loop to dynamically select rows from dataframe, then append selected rows to another dataframe: df.query()

I am currently working with dataframes in pandas. In sum, I have a dataframe called "Claims" filled with customer claims data, and I want to parse all the rows in the dataframe based on the unique values found in the field 'Part ID.' I would then like to take each set of rows and append it one at a time to an empty dataframe called "emptydf." This dataframe has the same column headings as the "Claims" dataframe. Since the values in the 'Part ID' column change from week to week, I would like to find some way to do this dynamically, rather than comb through the dataframe each week manually. I was thinking of somehow incorporating the df.where() expression and a For Loop, but am at a loss as to how to put it all together. Any insight into how to go about this, or even some better methods, would be great! The code I have thus far is divided into two steps as follows:
emptydf = Claims[0:0]
#Create empty dataframe
2.Parse_Claims = Claims.query('Part_ID == 1009')
emptydf = emptydf.append(Parse_Claims)
#Parse the dataframe by each unique Part ID number and append to empty dataframe. As you can see, I can only hard code one Part ID number at a time so far. This would take hours to complete manually, so I would love to figure out a way to iterate through the Part ID column and append the data dynamically.
Needless to say, I am super new to Python, so I definitely appreciate your patience in advance!
empty_df = list(Claims.groupby(Claims['Part_ID']))
this will create a list of tuples one for each part id. each tuple has 2 elements 1st is part id and 2nd is subset for that part id

Categories