overwriting dataframes in pandas - python

I have a given dataframe
new_df :
ID
summary
text_len
1
xxx
45
2
aaa
34
I am performing some df manipulation by concatenating keywords from different df, like that:
keywords = df["keyword"].to_list()
for key in keywords:
new_df[key] = new_df["summary"].str.lower().str.count(key)
new_df
from here I need two separate dataframes to perform few actions (to each of them add some columns, do some calculations etc).
I need a dataframe with occurrences as per given piece of code and a binary dataframe.
WHAT I DID:
assign dataframe for occurrences:
df_freq = new_df (because it is already calculated an done)
I created another dataframe - binary one - on the top of new_df:
#select only numeric columns to change them to binary
numeric_cols = new_df.select_dtypes("number", exclude='float64').columns.tolist()
new_df_binary = new_df
new_df_binary['text_length'] = new_df_binary['text_length'].astype(int)
new_df_binary[numeric_cols] = (new_df_binary[numeric_cols] > 0).astype(int)
Everything works fine - I perform the math I need, but when I want to come back to df_freq - it is no longer dataframe with occurrences.. looks like it changed along with binary code
I need separate tables and perform separate math on them. Do you know how I can avoid this hmm overwriting issue?

You may use pandas' copy method with the deep argument set to True:
df_freq = new_df.copy(deep=True)
Setting deep=True (which is the default parameter) ensures that modifications to the data or indices of the copy do not impact the original dataframe.

Related

Pulling columns of dataframe into separate dataframe, then replacing duplicates with mean values

I'm new to the world of python so I apologize in advance if this question seems pretty rudimentary. I'm trying to pull columns of one dataframe into a separate dataframe. I want to replace the duplicate columns from the first dataframe with one column that contains the mean values into the second dataframe. I hope this makes sense!
To provide some background, I am tracking gene expression over certain time points. I have a dataframe that is 17 rows x 33 columns. Every row in this data frame corresponds to a particular exon. Every column on this data frame corresponds to a time-point (AGE).
The dataframe looks like this:
Some of these columns contain the same name (age) and I'd like to calculate the mean of ONLY the columns with the same name, so that, for example, I get one column for "12 pcw" rather than three separate columns for "12 pcw." After which I hope to pull these values from the first dataframe into a second dataframe for averaged values.
I'm hoping to use a for loop to loop through each age (column) to get the average expression across the subjects.
I will explain my process so far below:
#1) Get list of UNIQUE string names from age list
unique_ages = set(column_names)
#2) Create an empty dataframe that gives an outline of what I want my averaged data to fit/be put in
mean_df = pd.DataFrame(index=exons, columns=unique_ages)
#3) Now I want to loop through each age to get the average expression across the donors present. This is where I'm trying to utilize a for loop to create a pipeline to process other data frames that I will be working with in the future.
for age in unique_ages:
print(age)
age_df = pd.DataFrame() ##pull columns of df as separate df that have this string
if len(age_df.columns) > 1: ##check if df has >1 SAME column, if so, take avg across SAME columns
mean = df.mean(axis=1)
mean_df[age] = mean
else:
## just pull out the values and put them into your temp_df
#4) Now, with my new averaged array (or same array if multiple ages NOT present), I want to place this array into my 'temp_df' under the appropriate columns. I understand that I should use the 'age' variable provided by the for loop to get the proper locationname of the column in my temp df. However I'm not sure how to do this. This has all been quite a steep learning curve and I feel like it's a simple solution but I can't seem to wrap my head around it. Any help would be greatly appreciated.
There is no need for a for loop (there often isn't with Pandas :)). You can simply use df.groupby(lambda x:x, axis=1).mean(). An example:
data = [[1,2,3],[4,5,6]]
cols = ['col1', 'col2', 'col2']
df = pd.DataFrame(data=data, columns=cols)
# col1 col2 col2
# 0 1 2 3
# 1 4 5 6
df = df.groupby(lambda x:x, axis=1).mean()
# col1 col2
# 0 1.0 2.5
# 1 4.0 5.5
The groupby function takes another function (the lambda) which basically means that it will insert each column name, and that it will return the group that column belongs to. In our case, we just want the column name itself to be the group. So, on the third column named col2, it will say 'this column belongs to group named col2' which already exists (because the second column was passed earlier). You then provide the aggregation you want, in this case the mean().

pandas: return mutated column into original dataframe

Ive attempted to search the forum for this question, but, I believe I may not be asking it correctly. So here it goes.
I have a large data set with many columns. Originally, I needed to sum all columns for each row by multiple groups based on a name pattern of variables. I was able to do so via:
cols = data.filter(regex=r'_name$').columns
data['sum'] = data.groupby(['id','group'],as_index=False)[cols].sum().assign(sum = lambda x: x.sum(axis=1))
By running this code, I receive a modified dataframe grouped by my 2 factor variables (group & id), with all the columns, and the final sum column I need. However, now, I want to return the final sum column back into the original dataframe. The above code returns the entire modified dataframe into my sum column. I know this is achievable in R by simply adding a .$sum at the end of a piped code. Any ideas on how to get this in pandas?
My hopeful output is just a the addition of the final "sum" variable from the above lines of code into my original dataframe.
Edit: To clarify, the code above returns this entire dataframe:
All I want returned is the column in yellow
is this what you need?
data['sum'] = data.groupby(['id','group'])[cols].transform('sum').sum(axis = 1)

Split list into a row of columns in Pandas

I have an image that I can read in via imageio.imread(). However, for simplicity's sake, let's say I have a list of [1,2,3,4].
I'd like to transform that list into a dataframe of columns, represented as
pixel_1 pixel_2 pixel_3 pixel_4
0 1 2 3 4
(The column names are not of vital importance; I just want to spread the data across columns.)
I've tried various apply and map methods but I'm just not sure how to read an image via imageio.imread() into a column spread like this.
Hi Check if below lines are helpful
import pandas as pd
# Create Blank Dataframe
df = pd.DataFrame()
input_list = [1,2,3,4]
header_value ='Pixcel'
for item in input_list:
df[header_value+"_"+str(item)]=[item]
# Out of loop
print(df)

Difference between "as_index = False", and "reset_index()" in pandas groupby

I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?
When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories