I am relatively new to python, so please excuse any confusion which may arrise due to my bad terminology.
Anyways, I am currently stuck with trying to obtain the first value for each index of level 2 of a multiindexed dataframe. The df has 2 indexes, level 1 being 'user' and level 2 being 'trial'. Both 'user' and 'trial' are integer values, while 't' are continuous float values.
Basically I want to extract the first 't' value of the following dataframe for each trial, for each user: df=dataframe in question.
I have used df['user'].unique() and df['trial'].unique() (before doing df.set_index(['user','trial']))and discovered that there are 1040 unique users and 97 unique trials. The main problem is that not each user has the same unique trial numbers (i.e. user 1 has a trial number 5, while user 2 does nit, and so on).
Is there anyway to obtain these values and later compile them in a similar dataframe, df2, which is also indexed by 'user' and 'trial'?
Thanks in advance!
Use pd.drop_duplicates
df = df.reset_index()
df = df.drop_duplicates(subset=['user', 'trial'], keep='first')
df = df.set_index(['user', 'trial'])
(replace <column>by the name of the column containing the values you want to sort)
Related
I'm new to the world of python so I apologize in advance if this question seems pretty rudimentary. I'm trying to pull columns of one dataframe into a separate dataframe. I want to replace the duplicate columns from the first dataframe with one column that contains the mean values into the second dataframe. I hope this makes sense!
To provide some background, I am tracking gene expression over certain time points. I have a dataframe that is 17 rows x 33 columns. Every row in this data frame corresponds to a particular exon. Every column on this data frame corresponds to a time-point (AGE).
The dataframe looks like this:
Some of these columns contain the same name (age) and I'd like to calculate the mean of ONLY the columns with the same name, so that, for example, I get one column for "12 pcw" rather than three separate columns for "12 pcw." After which I hope to pull these values from the first dataframe into a second dataframe for averaged values.
I'm hoping to use a for loop to loop through each age (column) to get the average expression across the subjects.
I will explain my process so far below:
#1) Get list of UNIQUE string names from age list
unique_ages = set(column_names)
#2) Create an empty dataframe that gives an outline of what I want my averaged data to fit/be put in
mean_df = pd.DataFrame(index=exons, columns=unique_ages)
#3) Now I want to loop through each age to get the average expression across the donors present. This is where I'm trying to utilize a for loop to create a pipeline to process other data frames that I will be working with in the future.
for age in unique_ages:
print(age)
age_df = pd.DataFrame() ##pull columns of df as separate df that have this string
if len(age_df.columns) > 1: ##check if df has >1 SAME column, if so, take avg across SAME columns
mean = df.mean(axis=1)
mean_df[age] = mean
else:
## just pull out the values and put them into your temp_df
#4) Now, with my new averaged array (or same array if multiple ages NOT present), I want to place this array into my 'temp_df' under the appropriate columns. I understand that I should use the 'age' variable provided by the for loop to get the proper locationname of the column in my temp df. However I'm not sure how to do this. This has all been quite a steep learning curve and I feel like it's a simple solution but I can't seem to wrap my head around it. Any help would be greatly appreciated.
There is no need for a for loop (there often isn't with Pandas :)). You can simply use df.groupby(lambda x:x, axis=1).mean(). An example:
data = [[1,2,3],[4,5,6]]
cols = ['col1', 'col2', 'col2']
df = pd.DataFrame(data=data, columns=cols)
# col1 col2 col2
# 0 1 2 3
# 1 4 5 6
df = df.groupby(lambda x:x, axis=1).mean()
# col1 col2
# 0 1.0 2.5
# 1 4.0 5.5
The groupby function takes another function (the lambda) which basically means that it will insert each column name, and that it will return the group that column belongs to. In our case, we just want the column name itself to be the group. So, on the third column named col2, it will say 'this column belongs to group named col2' which already exists (because the second column was passed earlier). You then provide the aggregation you want, in this case the mean().
I have a large dataset with multiple columns across different months. I have two identifiers I use which are License and Location. This is a sample of what my data looks like:
https://i.stack.imgur.com/aU8JU.png
I am in the midst of migrating my data and one of my sheet requires all the columns except for "Type" column. However, when I migrate over, I would have duplicate rows since there are repeated licenses and location. I want to sum up these repeated licenses and location at every month. This is my desired output:
https://i.stack.imgur.com/WwIz2.png
My migration code so far is:
def migrate(df, template):
inventory = df.copy()
inventory = inventory[['License', 'Location', 'Date',
'Quantity'
]]
What other scripts can I write to achieve what I want?
First, create a GroupBy object using a list of the column names for which you want to keep all unique combinations (using as_index to retain them as columns instead of transforming them into indices). This object essentially stores the values in the remaining columns for each unique combination of the given ones. You can then aggregate these remaining values using a desired aggregation function--in your case, sum.
df_grouped = df.groupby(["License", "Location", "Date"], as_index=False).agg(sum)
Here is the output for the first four rows of your input dataframe:
License Location Date Type Quantity
0 123 aa 1/1/16 abcdebcdef 4
1 456 bb 1/1/16 fffff 3
2 789 cc 1/1/16 ggggg 4
As you can see, all values in the "Type" and "Quantity" columns for matching "License", "Location" and "Date" entries have been combined using the summation function. For "Type", that means that the strings have been concatenated; I've just included that here as a consistency check. You can drop that column once you've verified that it has worked as planned.
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.
You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.
So I imported and merged 4 csv's into one dataframe called data. However, upon inspecting the dataframe's index with:
index_series = pd.Series(data.index.values)
index_series.value_counts()
I see that multiple index entries have 4 counts. I want to completely reindex the data dataframe so each row now has a unique index value. I tried:
data.reindex(np.arange(len(data)))
which gave the error "ValueError: cannot reindex from a duplicate axis." A google search leads me to think this error is because the there are up to 4 rows that share a same index value. Any idea how I can do this reindexing without dropping any rows? I don't particularly care about the order of the rows either as I can always sort it.
UPDATE:
So in the end I did find a way to reindex like I wanted.
data['index'] = np.arange(len(data))
data = data.set_index('index')
As I understand it, I just added a new column called 'index' to my data frame, and then set that column as my index.
As for my csv's, they were the four csv's under "download loan data" on this page of Lending Club loan stats.
It's pretty easy to replicate your error with this sample data:
In [92]: data = pd.DataFrame( [33,55,88,22], columns=['x'], index=[0,0,1,2] )
In [93]: data.index.is_unique
Out[93]: False
In [94:] data.reindex(np.arange(len(data))) # same error message
The problem is because reindex requires unique index values. In this case, you don't want to preserve the old index values, you merely want new index values that are unique. The easiest way to do that is:
In [95]: data.reset_index(drop=True)
Out[72]:
x
0 33
1 55
2 88
3 22
Note that you can leave off drop=True if you want to retain the old index values.