I have a list of 4/5 data frames with a blank column 'id', I want to generate a cumulative sequence across these data frames.
I tried something like this but its not working.
f is my list of data frames
for i in range(1,len(f)):
print(f[1]['id'])
for row in f[i]['id']:
f[i]=f[i].assign(id=numpy.arange(1,len(f)+1))
I want an output data frames like
f[0]:
id
table
1
abc
2
def
f[1]:
id
table
3
abc
4
def
f[2]:
id
table
5
abc
6
def
so on..
Please help.New to python
A simple loop approach could be:
dfs = [df1, df2]
start = 0
for d in dfs:
stop = start + len(d)
d['id'] = range(start, stop)
start = stop
I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6
I have 37 data frames of different shapes, I need to concatenate them. The following is what I tried:
path = '/Users/data_frames/data'
all_files = [os.path.join(path,i) for i in os.listdir(path) if i.endswith('pc.tsv')]
main = []
for files in all_files:
dfs = pd.DataFrame.from_csv(files, sep="\t")
dfs.reset_index(drop=True, inplace=True)
main.append(pc_matrix)
merged_pr_matrix =pd.concat(main,axis=1)
The above script runs with this line,
dfs.reset_index(drop=True, inplace=True)
However, I lost the original index values (row names). I would like to keep them. For example, now I have the final matrix after concatenating, as following:
ABV TCG FGH HKL MK MYT JUJN MTPTA
0 5130132,5 22778,703125 675790,6875 4846942,5 106934,4453125 2884897,25 2777415 3487836
1 3478507,5 898987,375 2825588,5 5006338,5 119250,765625 4393944,5 3111324,25 2594582,75
2 18402,615234375 56879,6484375 524456,3125 323671,4063 166333,4375 78539,921875 233480,0625 35772,69140625
3 2310551,5 587836,1875 241836,5 5488325 29411,296875 517361,46875 190795,078125 67885,640625
4 95646,140625 1106308 1356453 17681780 592893,9375 1857957 1224196 1417179,25
In the original inputs I had values in index which I would like to keep.
I have a data frame which contains 34 rows and 10 columns. I called the data frame "comp" now I did "invcomp = 1/comp", So the values changed but column name will be same. I want to replace or rename my column names, suppose the earlier name of my first column was "Cbm_m" in "comp", now I want to convert it to "Cbm_m_inv" in "invcomp". Extending or adding an extra term in last.
Use 'add_suffix':
invcomp = invcomp.add_suffix('_inv')
Setup:
invcomp = pd.DataFrame(pd.np.random.rand(5,5), columns=list('ABCDE'))
invcomp = invcomp.add_suffix('_inv')
Output:
A_inv B_inv C_inv D_inv E_inv
0 0.111604 0.016181 0.384071 0.608118 0.944439
1 0.523085 0.139200 0.495815 0.007926 0.183498
2 0.090169 0.357117 0.381938 0.222261 0.788706
3 0.802219 0.002049 0.173157 0.716345 0.182829
4 0.260781 0.376730 0.646595 0.324361 0.345097
I have a Pandas Dataframe consisting of multiple .fits files, each one containing multiple columns with individual labels. I'd like to extract one column and create variables that contain the first and last rows of said column but I'm having a hard time accomplishing that for the individual .fits files and not just the entire Dataframe. Any help would be appreciated! :)
Here is how I read in my files:
path = '/Users/myname/folder/'
m = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in fnmatch.filter(files, '*.fits')]
^^^ This recursively searches through my directory containing multiple .fits files in many subfolders.
dataframes = []
for ii in range(0,len(m)):
data = pd.read_csv(m[ii], header = 'infer', delimiter = '\t')
d = pd.DataFrame(data)
top = d['desired_column'].head()
bottom = d['desired_column'].tail()
First_and_Last = pd.concat([top,bottom])
I tried using the .head and .tail commands for Pandas Dataframes but I am unsure how to properly use it for what I desire. For how I read in my fits files, the following code gives me the very first few rows and the very last few rows (5 to be exact with the default value for head and tail being 5) as seen here:
0 2.456849e+06
1 2.456849e+06
2 2.456849e+06
3 2.456849e+06
4 2.456849e+06
1118 2.456852e+06
1119 2.456852e+06
1120 2.456852e+06
1121 2.456852e+06
1122 2.456852e+06
What I want to do is try to get the first and last row for each .fits file for the specific column I want and not just for the Dataframe containing the .fits files. With the way I am reading in my .fits files, the Dataframe seems to sort of concatenate all the files together. Any tips on how I can accomplish this goal?
If you want only the first row:
top = d['desired_column'].head(1)
If you want only the last row:
bottom = d['desired_column'].tail(1)
I didn't find the problem of "Dataframe seems to sort of concatenate all the files together." Would you please clarify the question?
Btw, after data = pd.read_csv(m[ii], header = 'infer', delimiter = '\t'), data is already a DataFrame. Therefore, d = pd.DataFrame(data) is unnecessary.
The .iloc function should easily pull the top and bottom row, where df["col_1"] here below represents the column of interest:
In [28]: import pandas as pd
In [29]: import numpy as np
In [30]: np.random.seed(42)
In [31]: df = pd.DataFrame(np.random.randn(6,3), columns=["col_1", "col_2", "col_3"])
In [32]: df
Out[32]:
col_1 col_2 col_3
0 0.496714 -0.138264 0.647689
1 1.523030 -0.234153 -0.234137
2 1.579213 0.767435 -0.469474
3 0.542560 -0.463418 -0.465730
4 0.241962 -1.913280 -1.724918
5 -0.562288 -1.012831 0.314247
In [33]: pd.Series([df["col_1"].iloc[0], df["col_1"].iloc[-1]]) # pd.Series([top, bottom]) ; or pd.DataFrame([top, bottom]), if data frame needed.
Out[33]:
0 0.496714
1 -0.562288
dtype: float64