I have a tab-limited csv file with a dataset with 2 columns (time and value) of data of type float. I have 100s of these files from a lab equipment. An example set is shown below.
3.64 1.22e-11
4.14 2.44e-11
4.64 1.22e-11
5.13 2.44e-11
5.66 1.22e-11
6.17 1.22e-11
6.67 2.44e-11
7.17 2.44e-11
7.69 1.22e-11
8.20 2.44e-11
8.70 1.22e-11
9.20 2.44e-11
9.72 2.44e-11
10.22 1.22e-11
10.72 1.22e-11
11.22 1.22e-11
11.72 1.22e-11
12.22 1.22e-11
12.70 -1.95e-10
13.22 -1.57e-09
13.73 -3.04e-09
14.25 -4.39e-09
14.77 -5.73e-09
15.28 -7.02e-09
15.80 -8.26e-09
16.28 -8.61e-09
16.83 -8.70e-09
17.31 -8.76e-09
17.81 -8.80e-09
18.31 -8.83e-09
18.83 -8.91e-09
19.33 -8.98e-09
19.84 -9.02e-09
20.34 -9.05e-09
20.84 -9.06e-09
21.34 -9.07e-09
21.88 -9.08e-09
22.39 -9.08e-09
22.89 -9.09e-09
23.39 -9.09e-09
23.89 -9.09e-09
24.41 -9.09e-09
I want to trim the data to reset time (x,1st col to 0) when the value (y/2nd column) starts to change, and also trim after the value plateaus.
For 1st derivative, if I use NumPy.gradient, I can see where the data changes, but I couldn't find a similar function for pandas.
Any suggestions?
Added: Output (done in excel manually) will look like below where (in this case) first 18 rows and last 3 are removed. The first row is set to 0 by subtracting all values from the previous row.
0.00 0.000000000000
0.52 -0.000000001375
1.03 -0.000000002845
1.55 -0.000000004195
2.07 -0.000000005535
2.58 -0.000000006825
3.10 -0.000000008065
3.58 -0.000000008415
4.13 -0.000000008505
4.61 -0.000000008565
5.11 -0.000000008605
5.61 -0.000000008635
6.13 -0.000000008715
6.63 -0.000000008785
7.14 -0.000000008825
7.64 -0.000000008855
8.14 -0.000000008865
8.64 -0.000000008875
9.18 -0.000000008885
9.69 -0.000000008885
10.19 -0.000000008895
What I have tried is using python and pandas to differentiate and then remove where derivative is 0, but that removes data point within the output I want too.
dfT = df1[df1.dB != 0]
dfT = dfT[df1.dB >= 0]
dfT = dfT.dropna()
dfT = dfT.reset_index(drop=True)
dfT
Why not use what is already working aka np.gradient and put it back to your dataframe? I am not able to create your final desired output however, since it looks like you rely on more than just filtering out gradient = 0? Open to fixing it once I get a bit clearer on logic
### imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
### data
time = [3.64,4.14,4.64,5.13,5.66,6.17,6.67,7.17,7.69,8.2,8.7,9.2,9.72,10.22,10.72,11.22,11.72,12.22,12.7,13.22,13.73,14.25,14.77,15.28,15.8,16.28,16.83,17.31,17.81,18.31,18.83,19.33,19.84,20.34,20.84,21.34,21.88,22.39,22.89,23.39,23.89,24.41,21.88,22.39,22.89,23.39,23.89,24.41]
value = [0.0000000000122,0.0000000000244,0.0000000000122,0.0000000000244,0.0000000000122,0.0000000000122,0.0000000000244,0.0000000000244,0.0000000000122,0.0000000000244,0.0000000000122,0.0000000000244,0.0000000000244,0.0000000000122,0.0000000000122,0.0000000000122,0.0000000000122,0.0000000000122,-0.000000000195,-0.00000000157,-0.00000000304,-0.00000000439,-0.00000000573,-0.00000000702,-0.00000000826,-0.00000000861,-0.0000000087,-0.00000000876,-0.0000000088,-0.00000000883,-0.00000000891,-0.00000000898,-0.00000000902,-0.00000000905,-0.00000000906,-0.00000000907,-0.00000000908,-0.00000000908,-0.00000000909,-0.00000000909,-0.00000000909,-0.00000000909,-0.00000000908,-0.00000000908,-0.00000000909,-0.00000000909,-0.00000000909,-0.00000000909]
### dataframe creation
# df = pd.read_csv('test.csv', names=["time", "value"])
df = pd.DataFrame({'time':time, 'value':value})
plt.plot(df.time,df.value)
Outputs:
Next you can differentiate and as you can see within your first 18 rows you mentioned there are multiple points where gradient is greater than 0:
df['gradient'] = np.gradient(df.value.values)
df
plt.plot(df.time,df.gradient)
Outputs:
Next filter out non change and add new time
### filter data where gradient is not 0 and add new time
df_filtered = df[df.gradient != 0]
df_filtered['time_difference'] = df_filtered.time.diff().fillna(0)
df_filtered['new_time'] = df_filtered['time_difference'].cumsum()
df_filtered.reset_index(drop=True,inplace=True)
df_filtered
Outputs:
Related
How to combine multiple txt files into one merged file, where each file contains different number of columns(with Float values usually) and I need to get one merged file with all the columns as follows:
EDIT:
there is one rule: In case there is a non-numeric value ("Nan" for example..), I need to do padding according to the last numeric value that was before it.
file1.txt
1.04
2.26
3.87
file2.txt
5.44 4.65 9.86
8.67 Nan 7.45
8.41 6.54 6.21
file3.txt
6.98 6.52
4.45 8.74
0.58 4.12
merged.txt
1.04 5.44 4.65 9.86 6.98 6.52
2.26 8.67 8.67 7.45 4.45 8.74
3.87 8.41 6.54 6.21 0.58 4.12
I saw here answer to the case of one column in each file.
how can I do this for multiple columns?
The simplest way is probably using numpy:
import numpy as np
filenames = ["file1.txt", "file2.txt", "file3.txt"]
fmt = '%.2f' # assuming format is known in advance
all_columns = []
for filename in filenames:
all_columns.append(np.genfromtxt(filename))
arr_out = np.column_stack(tuple(all_columns)) # Stack columns
# Fill NaN-elements with last numeric value
arr_1d = np.ravel(arr_out) # "flat reference" to arr_out
replaced_all_nan = False
nan_indices = np.where(np.isnan(arr_1d))
while len(nan_indices[0]):
new_indices = tuple([i-1 for i in nan_indices])
arr_1d[nan_indices] = arr_1d[new_indices]
nan_indices = np.where(np.isnan(arr_1d))
np.savetxt("merged.txt", arr_out, fmt=fmt)
One problem (if it is one for you) that might occur is that the very first, i.e. the upper-left element, is non-numeric. In that case, the last (lower-right) value or the last numeric value before that would be used.
I have a dataframe that contains data collected every 0.01m down into the earth. Due to its high resolution the resulting size of the dataset is very large. Is there a way in pandas to downsample to 5m intervals thus reducing the size of the dataset?
RESULT (Every 0.01m)
Depth_m
value
1.34
31.7
1.35
31.7
1.36
31.7
1.37
31.9
1.38
31.9
1.39
31.9
1.40
31.9
....
.....
44.35
32.9
44.36
32.9
44.37
32.9
OUTCOME I WANT (Every 5m)
Depth_m
value
5.47
31.7
10.49
31.7
15.51
31.7
20.53
31.9
25.55
31.9
30.57
31.9
35.59
31.9
40.61
31.9
45.63
31.9
I have tried to use pandas.resample but that seems to only work with timeseries data. I think I understand what I must do but not sure how to do it in pandas. Basically I am thinking I need to calculate what the current sampling rate is, in this case 0.01m. Then how many observations are there every 5m. Then I can average the values based on the number of observations and drop the rows. Loop through this process every 5m.
You can use Panda's .iloc for selection by position coupled with a slice object to downsample. Some care must be taken to ensure you have integer step sizes and not floats when converting from non-integer sample intervals (hence the use of astype("int")).
import numpy as np
import pandas as pd
sequence_interval = 0.01
downsampled_interval = 5
step_size = np.round(downsampled_interval / sequence_interval).astype("int")
df = pd.DataFrame(
{
"Depth_m": np.arange(131, 4438) / 100,
"value": np.random.random(size=4307),
}
)
downsampled_df = df.iloc[::step_size, :]
print(downsampled_df)
The result is
Depth_m value
0 1.31 0.357536
500 6.31 0.384327
1000 11.31 0.302109
1500 16.31 0.200971
2000 21.31 0.689973
2500 26.31 0.712869
3000 31.31 0.776306
3500 36.31 0.221901
4000 41.31 0.661378
There is no resample for integer values. As a workaround, you could round the Depth to the nearest 5 and use groupby to get the average Value for every 5m depth:
>>> df.groupby(df["Depth_m"].apply(lambda x: 5*round(x/5)))["Value"].mean()
Depth_m
0 34.256410
5 34.274549
10 34.564870
15 34.653307
20 34.630739
25 34.517034
30 34.584830
35 34.581162
40 34.620758
45 34.390374
Name: Value, dtype: float64
Input df:
import numpy as np
np.random.seed(100)
df = pd.DataFrame({"Depth_m": [i/100 for i in range(134, 4438)],
"Value": np.random.randint(30, 40, size=4304)})
I want to read one csv file into Jupyter Notebook with Python's Pandas lib.
I have uploaded .csv file into jupyter notebook, and I wrote a code, but I think that my dataframe does not display correctly.
This is the code for reading the file:
df = pd.read_csv('text analysis.csv')
print(df)
And my output, when I print that dataframe looks like this:
avg title word len. avg text word len. avg sent. len. \
0 5.20 4.27 11.00
1 4.69 4.98 26.20
2 5.50 4.53 21.62
3 4.82 4.42 15.10
4 6.40 5.07 36.50
... ... ... ...
34205 4.29 4.96 24.60
34206 4.67 4.58 13.00
34207 4.92 5.08 26.79
34208 4.09 4.72 22.23
34209 4.75 5.76 18.38
I have seen much better representation on JN, with all cells. This looks worse then when I print dataframe in idle
Try to use display() insted of print() and check it.
Use this to your code.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)
This sets it to dispay the entire dataframe.
Suppose I have a pandas dataframe with 16 columns and approx 1000 rows,
the format is like this
date_time sec01 sec02 sec03 sec04 sec05 sec06 sec07 sec08 sec09 sec10 sec11 sec12 sec13 sec14 sec15 sec16
1970-01-01 05:54:17 8.50 8.62 8.53 8.45 8.50 8.62 8.53 8.45 8.42 8.39 8.39 8.40 8.47 8.54 8.65 8.70
1970-01-01 05:56:55 8.43 8.62 8.55 8.45 8.43 8.62 8.55 8.45 8.42 8.39 8.39 8.40 8.46 8.53 8.65 8.71
and now I need to make another pandas dataframe with 32 columns:
x_sec01 y_sec01 x_sec02 y_sec02 x_sec03 y_sec03 x_sec04 y_sec04 x_sec05 y_sec05 x_sec06 y_sec06 x_sec07 ...
where the values of each column needs to be multiplied with a specific mathematical constant which is dependent on the column number (sector number):
x = sec_data * (math.cos(math.radians(1.40625*(sector_number))))
y = sec_data * (math.sin(math.radians(1.40625*(sector_number))))
Thus each columns in the original pandas dataframe (sec01-sec16) needs to be converted to two columns (x_sec01,y_sec01) and the factor by which it has to be multiplied depends on the sector_number value.
Currently I am using this function and calling this for every single rows in a for loop that is taking too much of time.
def sec_to_xy(sec_no,sec_data): #function to convert sector data to xy coordinate system
for sec_convno in range(0,32,2):
sector_number = (77-(sec_no-1)*2) #goes from 79 till 49
x = sec_data * (math.cos(math.radians(1.40625*(sector_number))))
y = sec_data * (math.sin(math.radians(1.40625*(sector_number))))
return(x,y)
The general idea is to stack your values so you can apply numpy's fast, vectorized functions.
# stack the dataframe
df2 = df.stack().reset_index(level=1)
df2.columns = ['sec', 'value']
# extract the sector number
df2['sec_no'] = df2['sec'].str.slice(-2).astype(int)
# apply numpy's vectorized functions
import numpy as np
df2['x'] = df2['value'] * (np.cos(np.radians(1.40625*(df2['sec_no']))))
df2['y'] = df2['value'] * (np.sin(np.radians(1.40625*(df2['sec_no']))))
At this stage, here is what df2 looks like:
sec value sec_no x y
1970-01-01 05:54:17 sec01 8.50 1 8.497440 0.208600
1970-01-01 05:54:17 sec02 8.62 2 8.609617 0.422963
1970-01-01 05:54:17 sec03 8.53 3 8.506888 0.627506
1970-01-01 05:54:17 sec04 8.45 4 8.409311 0.828245
1970-01-01 05:54:17 sec05 8.50 5 8.436076 1.040491
Now pivot the table to return to the original shape:
df2[['sec', 'x', 'y']].pivot(columns='sec')
All that is left to do is rename the columns.
Here's an approach with NumPy -
# Extract as float array
a = df.values # Extract all 16 columns
m,n = a.shape
# Scaling array
s = np.radians(1.40625*(np.arange(79,47,-2)))
# Initialize output array and set cosine and sine values
out = np.zeros((m,n,2))
out[:,:,0] = a*np.cos(s)
out[:,:,1] = a*np.sin(s)
# Transfer to a dataframe output
df_out = pd.DataFrame(out.reshape(-1,n*2),index=df.index)
Please note that if there are actually 17 columns with the first column being date_time, then we need to skip the first column. So, at the start, get a with the following step instead -
a = df.ix[:,1:].values
I have the foll. csv file:
RUN YR AP15 PMTE
12008 4.53 0.04
12009 3.17 0.26
12010 6.20 1.38
12011 5.38 3.55
12012 7.32 6.13
12013 4.39 9.40
Here, the column 'YR' has the values 2008, 2009...2013. However, there is no space between the values for YR and values for RUN. Because of this, when I try to read in the dataframe, it does not read the YR column correctly.
pandas.read_csv('file.csv', skipinitialspace=True, usecols=['YR','PMTE'], sep=' ')
The line above reads in the AP15 column instead of YR. How do I fix this?
It seems like your 'csv' is really a fixed-width format file. Sometimes these are accompanied by another file listing the size of each column, but maybe you aren't that lucky, and have to count the column widths manually. You can read this file with pandas's fixed width reading function:
df = pd.read_fwf('fixed_width.txt', widths=[4, 4, 8, 8])
In [7]: df
Out[7]:
RUN YR AP15 PMTE
0 1 2008 4.53 0.04
1 1 2009 3.17 0.26
2 1 2010 6.20 1.38
3 1 2011 5.38 3.55
4 1 2012 7.32 6.13
5 1 2013 4.39 9.40
In [8]: df.columns
Out[8]: Index(['RUN', 'YR', 'AP15', 'PMTE'], dtype='object')
There is an option to find the widths automatically but it probably requires at least a space between each column, as it doesn't seem to work here.
One workaround you can do for this would be to first make the column RUN and YR as one for your csv . Example -
RUNYR AP15 PMTE
12008 4.53 0.04
12009 3.17 0.26
12010 6.20 1.38
12011 5.38 3.55
12012 7.32 6.13
12013 4.39 9.40
Then read the csv into a dataframe with RUNYR as a string column, and then slice the RUNYR column up to make two different columns using pandas.Series.str.slice method. Example -
df = pd.read_csv('file.csv', skipinitialspace=True, header=0, sep=' ',dtype={'RUNYR':str})
df['RUN'] = df['RUNYR'].str.slice(None,1).astype(int)
df['YR'] = df['RUNYR'].str.slice(1).astype(int)
df = df.drop('RUNYR',axis=1)
Demo -
In [21]: df = pd.read_csv('a.csv', skipinitialspace=True, header=0, sep=' ',dtype={'RUNYR':str})
In [22]: df['RUN'] = df['RUNYR'].str.slice(None,1).astype(int)
In [23]: df['YR'] = df['RUNYR'].str.slice(1).astype(int)
In [24]: df = df.drop('RUNYR',axis=1)
In [25]: df
Out[25]:
AP15 PMTE RUN YR
0 4.53 0.04 1 2008
1 3.17 0.26 1 2009
2 6.20 1.38 1 2010
3 5.38 3.55 1 2011
4 7.32 6.13 1 2012
5 4.39 9.40 1 2013
And then write this back to your csv using .to_csv method (to fix your csv permanently) .