Parsing and create a new df with conditions - python

I need some help with python and pandas.
I actually have a dataframe with in the column seq1_id al the seq_id of sequences of the species 1 and the column 2 for the sequences of the sp2.
I actually passed a filter on those sequences and got two dataframes (one with all sequences of sp 1 passed through the filter) and (one with all sequences of sp2 passed through the filter).
Then I have 3 dataframes.
Because in a pairs, one seq can pass the filter while the other does not, it is important to keep only paired genes which are keeping on the two previous filtering, so what I need to do is actually to parse my first df such this one:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
and check row by row if (ex the first row) seq1_A is present in the df2 and if seq8_B is also present in the df3, then keep this row in the df1 and add it in a new df4.
Here is an example with output wanted:
first df:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
df2 (sp1) (seq3_A is absent)
Seq_1.id
seq1_A
seq2_A
seq4_A
df3 (sp2) (Seq11_B is absent)
Seq_2.id
seq8_B
Seq9_B
Seq10_B
Then because Seq11_B and seq3_A are not present, the df4 (output) would be:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')
df4 =dN_dS[dN_dS['seq1_id'].isin(candidates_0042['gene'])&dN_dS['seq2_id'].isin(candidates_0035['gene'])]
and I got an empty output, only with columns names but it should not be like that.
Here are the data if you cant to test the code on it :
df1:
Unnamed: 0 seq1_id seq2_id dN dS Dist_third_pos Dist_brute Length_seq_1 Length_seq_2 GC_content_seq1 GC_content_seq2 GC Mean_length
0 0 g66097.t1_0035_0035 g13600.t1_0042_0042 0.10455938989199982 0.3122332927029104 0.23600000000000002 0.142 535.0 1024.0 49.1588785046729 51.171875 50.165376752336456 535.0
1 1 g45594.t1_0035_0035 g1464.t1_0042_0042 0.5208761055250978 5.430485421797574 0.7120000000000001 0.489 246.0 222.0 47.967479674796756 44.594594594594604 46.28103713469567 222.0
2 2 g50055.t1_0035_0035 g34744.t1_0042_0035 0.08040473491714645 0.4233916132491867 0.262 0.139 895.0 749.0 56.312849162011176 57.67690253671562 56.994875849363396 749.0
3 3 g34020.t1_0035_0035 g12096.t1_0042_0042 0.4385191689737516 26.834927363887587 0.5760000000000001 0.433 597.0 633.0 37.85594639865997 39.810426540284354 38.83318646947217 597.0
4 4 g28436.t1_0035_0042 g35222.t1_0042_0035 0.055299811368483165 0.1181241496387666 0.1 0.069 450.0 461.0 45.111111111111114 44.90238611713666 45.006748614123886 450.0
5 5 g1005.t1_0035_0035 g11524.t1_0042_0042 0.3528036631463747 19.32549458735676 0.71 0.512 3177.0 3804.0 39.06200818382121 52.944269190325976 46.0031386870736 3177.0
6 6 g28456.t1_0035_0035 g31669.t1_0042_0035 0.4608959702286786 26.823981621115166 0.6859999999999999 0.469 516.0 591.0 49.224806201550386 53.46869712351946 51.346751662534935 516.0
7 7 g6202.t1_0035_0035 g193.t1_0042_0042 0.4679458383555545 17.81312422445775 0.66 0.462 804.0 837.0 41.91542288557214 47.67025089605735 44.79283689081474 804.0
8 8 g60667.t1_0035_0035 g14327.t1_0042_0042 0.046056273155280165 0.13320612138898 0.122 0.067 348.0 408.0 56.89655172413793 55.392156862745104 56.1443542934415 348.0
9 9 g30148.t1_0035_0042 g37790.t1_0042_0035 0.05631607180881047 0.19747150378706246 0.12300000000000001 0.08800000000000001 405.0 320.0 59.012345679012356 58.4375 58.72492283950618 320.0
10 10 g24481.t1_0035_0035 g37405.t1_0042_0035 0.2151957757290965 0.15106487998618026 0.135 0.17600000000000002 270.0 276.0 51.111111111111114 51.44927536231884 51.28019323671497 270.0
11 11 g33270.t1_0035_0035 g21201.t1_0042_0035 0.2773062983971916 21.13839474189674 0.6940000000000001 0.401 297.0 357.0 54.882154882154886 50.42016806722689 52.65116147469089 297.0
12 12 EOG090X03YJ_0035_0035_1 EOG090X03YJ_0042_0042_1 0.5402471721616758 19.278839157918302 0.7070000000000001 0.488 1321.0 1719.0 38.53141559424678 43.92088423502036 41.22614991463357 1321.0
13 13 g13075.t1_0035_0042 g504.t1_0042_0035 0.3317504066721263 4.790120127840871 0.65 0.38799999999999996 372.0 408.0 59.40860215053763 51.470588235294116 55.43959519291587 372.0
14 14 g1026.t1_0035_0035 g7716.t1_0042_0042 0.21445770772761286 13.92799368027682 0.626 0.344 336.0 315.0 38.095238095238095 44.444444444444436 41.26984126984127 315.0
15 15 g18238.t1_0035_0042 g35401.t1_0042_0035 0.3889830456691637 20.33679494952895 0.6759999999999999 0.44799999999999995 320.0 366.0 50.9375 49.453551912568315 50.19552595628416 320.0
df2:
Unnamed: 0 gene scaf_name start end cov_depth GC
179806 g13600.t1_0042_0042 scaffold_6556 1 1149 2.42361684558216 0.528846153846154
315037 g34744.t1_0042_0035 scaffold_8076 17 765 3.49803921568627 0.386138613861386
317296 g35222.t1_0042_0035 scaffold_9018 1 614 93.071661237785 0.41
183513 g14327.t1_0042_0042 scaffold_9358 122 529 3.3184165232357996 0.36
328164 g37790.t1_0042_0035 scaffold_16356 1 320 2.73125 0.436241610738255
326617 g37405.t1_0042_0035 scaffold_14890 1 341 1.3061224489795902 0.36898395721925104
188515 g15510.t1_0042_0042 scaffold_20183 1 276 137.326086956522 0.669354838709677
184561 g14562.t1_0042_0042 scaffold_10427 1 494 157.993927125506 0.46145940390544704
290684 g30982.t1_0042_0035 scaffold_3800 440 940 174.499839537869 0.39823008849557506
179993 g13632.t1_0042_0042 scaffold_6654 29 1114 3.56506849315068 0.46153846153846206
181670 g13942.t1_0042_0042 scaffold_7830 1 811 5.307028360049321 0.529411764705882
196148 g20290.t1_0042_0035 scaffold_1145 2707 9712 78.84112231766741 0.367283950617284
313624 g34464.t1_0042_0035 scaffold_7610 1 480 7.740440324449589 0.549019607843137
303133 g32700.t1_0042_0035 scaffold_5119 1735 2373 118.436578171091 0.49074074074074103
df3:
Unnamed: 0 gene scaf_name start end cov_depth GC
428708 g66097.t1_0035_0035 scaffold_306390 1 695 32.2431654676259 0.389880952380952
342025 g50055.t1_0035_0035 scaffold_188566 15 954 7.062893081761009 0.351129363449692
214193 g28436.t1_0035_0042 scaffold_231066 1 842 25.9774346793349 0.348837209302326
400337 g60667.t1_0035_0035 scaffold_261197 309 656 15.873529411764698 0.353846153846154
224023 g30148.t1_0035_0042 scaffold_263686 10 414 23.2072538860104 0.34108527131782895
184987 g24481.t1_0035_0035 scaffold_65047 817 1593 27.7840552416824 0.533898305084746
249413 g34492.t1_0035_0035 scaffold_106432 1 511 3.2482544608223396 0.368318122555411
249418 g34493.t1_0035_0035 scaffold_106432 547 1230 3.2482544608223396 0.368318122555411
12667 g1120.t1_0035_0042 scaffold_2095 2294 2794 47.864745898359295 0.56203288490284
252797 g35042.t1_0035_0035 scaffold_108853 274 1276 20.269592476489 0.32735426008968604
255878 g36112.t1_0035_0042 scaffold_437464 1 540 74.8252551020408 0.27884615384615397
40058 g4082.t1_0035_0042 scaffold_11195 579 1535 33.4396168320219 0.48487467588591204
271053 g39343.t1_0035_0042 scaffold_590976 1 290 19.6666666666667 0.38636363636363596
89911 g10947.t1_0035_0035 scaffold_21433 1735 2373 32.4222503160556 0.408571428571429

This should do it:
df4 = df1[df1['Seq_1.id'].isin(df2['Seq_1.id'])&df1['Seq_2.id'].isin(df3['Seq_2.id'])]
df4
# Seq_1.id Seq_2.id
#0 seq1_A seq8_B
#1 seq2_A Seq9_B
EDIT
You must have made a permutation, this doesn't return empty:
df4 = dN_dS[(dN_dS['seq1_id'].isin(candidates_0035['gene']))&(dN_dS['seq2_id'].isin(candidates_0042['gene']))]

Related

How to split dataframe into multiple dataframes based on column-name?

I have a dataframe with columns like this:
['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0',
't_energy1', 't_energy2']
And I have a code which returns the average of three columns with the same name:
# Takes in a dataframe with three columns and returns a dataframe with one column of their means as integers
def average_column(dataframe):
dataframe = dataframe.copy() # To avoid SettingWithCopyWarning
# Create new column name without integers
temp = dataframe.columns.tolist()[0]
col_name = temp.rstrip(temp[2:-1])
dataframe[col_name] = dataframe.mean(axis=1) # Add column to the dataframe (axis=1 means the mean() is applied row-wise)
mean_df = dataframe.iloc[: , -1:] # Isolated column of the mean by selecting all rows (:) for the last column (-1:)
print("Original:\n{}\nAverage columns:\n{}".format(dataframe, mean_df))
return mean_df.astype(float)
This function gives me this output:
Original:
t_dance0 t_dance1 t_dance2 dance
0 0.549 0.623 0.5190 0.563667
1 0.871 0.702 0.4160 0.663000
2 0.289 0.328 0.2340 0.283667
3 0.886 0.947 0.8260 0.886333
4 0.724 0.791 0.7840 0.766333
... ... ... ... ...
Average columns:
dance
0 0.563667
1 0.663000
2 0.283667
3 0.886333
4 0.766333
... ...
I asked this question about how I can split it into unique and duplicate columns. Which led me to this code:
# Function that splits dataframe into two separate dataframes, one with all unique
columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
cols = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = cols[cols == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate columns:\n\n{}".format(unq_df, dup_df))
Which gives me this output:
Unique columns:
id
0 22352
1 106534
2 23608
3 8655
4 49670
... ...
Duplicate columns:
t_dur0 t_dur1 t_dur2 t_dance0 t_dance1 t_dance2
0 292720 293760.0 292733.0 0.549 0.623 0.5190
1 213760 181000.0 245973.0 0.871 0.702 0.4160
2 157124 130446.0 152450.0 0.289 0.328 0.2340
3 127896 176351.0 166968.0 0.886 0.947 0.8260
4 210320 226253.0 211880.0 0.724 0.791 0.7840
... ... ... ... ... ... ...
2828 70740 262400.0 220680.0 0.224 0.609 0.7110
2829 252226 222400.0 214973.0 0.526 0.623 0.4820
2830 269146 251560.0 172760.0 0.551 0.756 0.7820
2831 344764 425613.0 249652.0 0.473 0.572 0.8230
2832 210955 339869.0 304124.0 0.112 0.523 0.0679
I have tried to combine these functions in another function that takes in a dataframe and returns the dataframe with all duplicate columns replaced by their mean, but I have trouble with splitting the dups_df into smaller dataframes. Is there a simpler way I can do this?
An example on the desired output:
Original:
total_tracks t_dur0 t_dur1 t_dur2 t_dance0 t_dance1 t_dance2 \
0 4 292720 293760.0 292733.0 0.549 0.623 0.5190
1 12 213760 181000.0 245973.0 0.871 0.702 0.4160
2 59 157124 130446.0 152450.0 0.289 0.328 0.2340
3 8 127896 176351.0 166968.0 0.886 0.947 0.8260
4 17 210320 226253.0 211880.0 0.724 0.791 0.7840
... ... ... ... ... ... ... ...
After function:
total_tracks popularity duration dance
0 4 21 293071.000000 0.563667
1 12 14 213577.666667 0.663000
2 59 41 146673.333333 0.283667
3 8 1 157071.666667 0.886333
4 17 47 216151.000000 0.766333
... ... ... ...
Use wide_to_long for reshape original DataFrame first and then aggregate mean:
cols = ['total_tracks']
df1 = (pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp')
.reset_index()
.drop('tmp', 1)
.groupby(cols, as_index=False)
.mean())
print (df1)
total_tracks t_dur t_dance
0 4 293071.000000 0.563667
1 8 157071.666667 0.886333
2 12 213577.666667 0.663000
3 17 216151.000000 0.766333
4 59 146673.333333 0.283667
Details:
cols = ['total_tracks']
print(pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp'))
t_dur t_dance
total_tracks tmp
4 0 292720.0 0.549
12 0 213760.0 0.871
59 0 157124.0 0.289
8 0 127896.0 0.886
17 0 210320.0 0.724
4 1 293760.0 0.623
12 1 181000.0 0.702
59 1 130446.0 0.328
8 1 176351.0 0.947
17 1 226253.0 0.791
4 2 292733.0 0.519
12 2 245973.0 0.416
59 2 152450.0 0.234
8 2 166968.0 0.826
17 2 211880.0 0.784

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

what is the best way to merge pandas.Dataframe with pandas.Series based on df.columns and Series.index names?

im facing the following problem and i dont know what is the cleanest/smartest way to solve it.
I have a dataframe called wfm that contains the input for my simulation
wfm.head()
Out[51]:
OPN Vin Vout_ref Pout_ref CFN ... Cdclink Cdm L k ron
0 6 350 750 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
1 7 400 800 92000 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
2 8 350 900 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
3 9 450 750 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
4 10 450 900 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
[5 rows x 13 columns]
then every simulation loop I receive 2 Series outputs_rms and outputs_avg that look like this:
outputs_rms outputs_avg
Out[53]: Out[54]:
time.rms 0.057751 time.avg 5.78E-02
Vi_dc.voltage.rms 400 Vi_dc.voltage.avg 4.00E+02
Vi_dc.current.rms 438.333188 Vi_dc.current.avg 3.81E+02
Vi_dc.power.rms 175333.2753 Vi_dc.power.avg 1.53E+05
Am_in.current.rms 438.333188 Am_in.current.avg 3.81E+02
Cdm.voltage.rms 396.614536 Cdm.voltage.avg 3.96E+02
Cdm.current.rms 0.213185 Cdm.current.avg -5.14E-05
motor_phU.current.rms 566.035833 motor_phU.current.avg -5.67E+02
motor_phU.voltage.rms 296.466083 motor_phU.voltage.avg -9.17E-02
motor_phV.current.rms 0.061024 motor_phV.current.avg 2.58E-02
motor_phV.voltage.rms 1.059341 motor_phV.voltage.avg -1.24E-09
motor_phW.current.rms 566.005071 motor_phW.current.avg 5.67E+02
motor_phW.voltage.rms 297.343876 motor_phW.voltage.avg 9.17E-02
S_ULS.voltage.rms 305.017804 S_ULS.voltage.avg 2.65E+02
S_ULS.current.rms 358.031053 S_ULS.current.avg -1.86E+02
S_UHS.voltage.rms 253.340047 S_UHS.voltage.avg 1.32E+02
S_UHS.current.rms 438.417985 S_UHS.current.avg 3.81E+02
S_VLS.voltage.rms 295.509073 S_VLS.voltage.avg 2.64E+02
S_VLS.current.rms 0 S_VLS.current.avg 0.00E+00
S_VHS.voltage.rms 152.727975 S_VHS.voltage.avg 1.32E+02
S_VHS.current.rms 0.061024 S_VHS.current.avg -2.58E-02
S_WLS.voltage.rms 509.388666 S_WLS.voltage.avg 2.64E+02
S_WLS.current.rms 438.417985 S_WLS.current.avg 3.81E+02
S_WHS.voltage.rms 619.258959 S_WHS.voltage.avg 5.37E+02
S_WHS.current.rms 357.982417 S_WHS.current.avg -1.86E+02
Cdclink.voltage.rms 801.958092 Cdclink.voltage.avg 8.02E+02
Cdclink.current.rms 103.73088 Cdclink.current.avg 2.08E-05
Am_out.current.rms 317.863371 Am_out.current.avg 1.86E+02
Vo_dc.voltage.rms 800 Vo_dc.voltage.avg 8.00E+02
Vo_dc.current.rms 317.863371 Vo_dc.current.avg -1.86E+02
Vo_dc.power.rms 254290.6969 Vo_dc.power.avg -1.49E+05
CFN 1 CFN 1.00E+00
OPN 6 OPN 6.00E+00
dtype: float64 dtype: float64
then my goal is to place outputs_rms and outputs_avg on the right line of wfm, based on 'CFN' and 'OPN' values.
what is your suggestions?
thanks
Riccardo
Suppose that you create these series as outputs output_rms_1, output_rms_2, etc.,
than the series can be combined in one dataframe
import pandas as pd
dfRms = pd.DataFrame([output_rms_1, output_rms_2, output_rms_3])
Next output, say output_rms_10, can simply be added by using:
dfRms = dfRms.append(output_rms_10, ignore_index=True)
Finally, when all outputs are joined into one Dataframe,
you can merge the original wfm with the output, i.e.
result = pd.merge(wfm, dfRms, on=['CFN', 'OPN'], how='left')
Similarly for avg.

Calculate mean values from pandas dataframe

I am trying to find a good way to calculate mean values from values in a dataframe. It contains measured data from an experiment and is imported from an excel sheet. The columns contain the time passed by, electric current and the corresponding voltage.
The current is changed in steps and then held for some time (the current values vary a little bit, so they are not exactly the same for each step). Now I want to calculate the mean voltage for each current step. Since it takes some time after the voltage gets stable after a step, I also want to leave out the first few voltage values after a step.
Currently I am doing this with loops, but I was wondering wether there is a nicer way with the usage of the groupby function (or others maybe).
Just say if you need more details or clarification.
Example of data:
s [A] [V]
0 6.0 -0.001420 0.780122
1 12.0 -0.002484 0.783297
2 18.0 -0.001478 0.785870
3 24.0 -0.001256 0.793559
4 30.0 -0.001167 0.806086
5 36.0 -0.000982 0.815364
6 42.0 -0.003038 0.825018
7 48.0 -0.001174 0.831739
8 54.0 0.000478 0.838861
9 60.0 -0.001330 0.846086
10 66.0 -0.001456 0.851556
11 72.0 0.000764 0.855950
12 78.0 -0.000916 0.859778
13 84.0 -0.000916 0.859778
14 90.0 -0.001445 0.863569
15 96.0 -0.000287 0.864303
16 102.0 0.000056 0.865080
17 108.0 -0.001119 0.865642
18 114.0 -0.000843 0.866434
19 120.0 -0.000997 0.866809
20 126.0 -0.001243 0.866964
21 132.0 -0.002238 0.867180
22 138.0 -0.001015 0.867177
23 144.0 -0.000604 0.867505
24 150.0 0.000507 0.867571
25 156.0 -0.001569 0.867525
26 162.0 -0.001569 0.867525
27 168.0 -0.001131 0.866756
28 174.0 -0.001567 0.866884
29 180.0 -0.002645 0.867240
.. ... ... ...
242 1708.0 24.703866 0.288902
243 1714.0 26.469208 0.219226
244 1720.0 26.468838 0.250437
245 1726.0 26.468681 0.254972
246 1732.0 26.468173 0.271525
247 1738.0 26.468260 0.247282
248 1744.0 26.467666 0.296894
249 1750.0 26.468085 0.247300
250 1756.0 26.468085 0.247300
251 1762.0 26.467808 0.261096
252 1768.0 26.467958 0.259615
253 1774.0 26.467828 0.260871
254 1780.0 28.232325 0.185291
255 1786.0 28.231697 0.197642
256 1792.0 28.231170 0.172802
257 1798.0 28.231103 0.170685
258 1804.0 28.229453 0.184009
259 1810.0 28.230816 0.181833
260 1816.0 28.230913 0.188348
261 1822.0 28.230609 0.178440
262 1828.0 28.231144 0.168507
263 1834.0 28.231144 0.168507
264 1840.0 8.813723 0.641954
265 1846.0 8.814301 0.652373
266 1852.0 8.818517 0.651234
267 1858.0 8.820255 0.637536
268 1864.0 8.821443 0.628136
269 1870.0 8.823643 0.636616
270 1876.0 8.823297 0.635422
271 1882.0 8.823575 0.622253
Output:
s [A] [V]
0 303.000000 -0.000982 0.857416
1 636.000000 0.879220 0.792504
2 699.000000 1.759356 0.752446
3 759.000000 3.519479 0.707161
4 816.000000 5.278372 0.669020
5 876.000000 7.064800 0.637848
6 939.000000 8.828799 0.611196
7 999.000000 10.593054 0.584402
8 1115.333333 12.357359 0.556127
9 1352.000000 14.117167 0.528826
10 1382.000000 15.882287 0.498577
11 1439.000000 17.646748 0.468379
12 1502.000000 19.410817 0.437342
13 1562.666667 21.175572 0.402381
14 1621.000000 22.939826 0.365724
15 1681.000000 24.704600 0.317134
16 1744.000000 26.468235 0.256047
17 1807.000000 28.231037 0.179606
18 1861.000000 8.819844 0.638190
The current approach:
df = df[['s','[A]','[V]']]
#Looping over the rows to separate current points
b=df['[A]'].iloc[0]
start=0
list = []
for index, row in df.iterrows():
if not math.isclose(row['[A]'], b, abs_tol=1e-02):
b=row['[A]']
list.append(df.iloc[start:index])
start=index
list.append(df.iloc[start:])
#Deleting first few points after each current change
list_b = []
for l in list:
list_b.append(l.iloc[3:])
#Calculating mean values for each current point
list_c = []
for l in list_b:
list_c.append(l.mean())
result=pd.DataFrame(list_c)
Does this help?
df.groupby(['Columnname', 'Columnname2']).mean()
You may need to create intermediate dataframes for each step. Can you provide an example of the output you want?

Python (pandas): How to read data by chunks in a long file where chunks are separated by a header and are not equal in length?

I have the following data in a single txt file:
00001+++00001 000031 12.8600 -1 7 BEAR
1990052418 276.00 18.80 0.00 12.86
1990052500 276.00 19.70 0.00 12.86
00002+++00002 000047 30.8700 995 22 LION
1990072206 318.10 8.80 1010.00 12.86
1990072212 316.80 8.90 1010.00 12.86
1990072218 315.40 9.30 1010.00 12.86
1990072300 313.60 9.30 1010.00 12.86
00003+++00003 000050 36.0100 973 37 BIRD
1990072412 285.00 34.00 1012.00 10.29
1990072418 286.20 33.60 1013.00 10.29
1990072500 287.30 33.00 1013.00 10.29
How can I read each (unequal) chunk under each header separately in Python (Pandas)?
Here's a solution which parses each block (without reading in the entire file):
def read_fwfs(text_file):
res = {}
with open(text_file) as f:
header = f.readline()
block = []
for line in f:
if "+++" in line:
# End of block, so save it to dict as DataFrame.
# Note: this next line could be pulled out as a function.
res[header.split("+++")[0]] = pd.read_fwf(StringIO("".join(block)), header=None)
# Reset variables.
header = line
block = []
# Ignore blank lines.
elif line != "\n":
block.append(line)
# Save the last block.
# Note: See what I mean about being a function? Here it is again:
res[header.split("+++")[0]] = pd.read_fwf(StringIO("".join(block)), header=None)
return res
In [11]: d = read_fwfs("my_text_file.txt")
d
Out[11]:
{'00001': 0 1 2 3 4
0 1990052418 276 18.8 0 12.86
1 1990052500 276 19.7 0 12.86,
'00002': 0 1 2 3 4
0 1990072206 318.1 8.8 1010 12.86
1 1990072212 316.8 8.9 1010 12.86
2 1990072218 315.4 9.3 1010 12.86
3 1990072300 313.6 9.3 1010 12.86,
'00003': 0 1 2 3 4
0 1990072412 285.0 34.0 1012 10.29
1 1990072418 286.2 33.6 1013 10.29
2 1990072500 287.3 33.0 1013 10.29}
In [12]: d["00003"]
Out[12]:
0 1 2 3 4
0 1990072412 285.0 34.0 1012 10.29
1 1990072418 286.2 33.6 1013 10.29
2 1990072500 287.3 33.0 1013 10.29
If you want to do something more with the splitting line, e.g. use it as a header, you could add that into the read_fwf part e.g. by adding the second half of the split header to the joined block. As I mention inline, it's probably a good idea to pull that part out as a function anyway.

Categories