How to re-sample and interpolate new data in python - python

I have a csv file containing the following information:
Time(s) Variable
0.003 1
0.009 2
0.056 3
0.094 4
0.4 5
0.98 6
1.08 7
1.45 8
1.89 9
2.45 10
2.73 11
3.2 12
3.29 13
3.5 14
I would like to be able to be able to change the time column into 0.25s intervals starting from 0, and have the associated variable data change along with it (i.e if at 2.45 v=10, at 2.5 v=10.2). The variable data would have to be interpolated against the change in the time data I assume? I need to be able to do this straight from csv rather than writing out the data in python as the real data-set is 1000's of rows.
Not sure if what I want is exactly possible but some thoughts would go along way, thanks!

How about Scipy's interp1d
from scipy.interpolate import interp1d
interp = interp1d(df['Time(s)'], df['Variable'])
new_times = np.arange(0.25, 3.5, 0.25)
pd.DataFrame({'Time(s)': new_times, 'Variable':interp(new_times)})
Output:
Time(s) Variable
0 0.25 4.509804
1 0.50 5.172414
2 0.75 5.603448
3 1.00 6.200000
4 1.25 7.459459
5 1.50 8.113636
6 1.75 8.681818
7 2.00 9.196429
8 2.25 9.642857
9 2.50 10.178571
10 2.75 11.042553
11 3.00 11.574468
12 3.25 12.555556

Related

how to compute mean absolute deviation row wise in pandas

snippet of the dataframe is as follows. but actual dataset is 200000 x 130.
ID 1-jan 2-jan 3-jan 4-jan
1. 4 5 7 8
2. 2 0 1 9
3. 5 8 0 1
4. 3 4 0 0
I am trying to compute Mean Absolute Deviation for each row value like this.
ID 1-jan 2-jan 3-jan 4-jan mean
1. 4 5 7 8 12.5
1_MAD 8.5 7.5 5.5 4.5
2. 2 0 1 9 6
2_MAD.4 6 5 3
.
.
I tried this,
new_df = pd.DataFrame()
for rows in (df['ID']):
new_df[str(rows) + '_mad'] = mad(df3.loc[row_value][1:])
new_df.T
where mad is a function that compares the mean to each value.
But, this is very time consuming since i have a large dataset and i need to do in a quickest way possible.
pd.concat([df1.assign(mean1=df1.mean(axis=1)).set_index(df1.index.astype('str'))
,df1.assign(mean1=df1.mean(axis=1)).apply(lambda ss:ss.mean1-ss,axis=1)
.T.add_suffix('_MAD').T.assign(mean1='')]).sort_index().pipe(print)
1-jan 2-jan 3-jan 4-jan mean1
ID
1.0 4.00 5.00 7.00 8.00 6.0
1.0_MAD 2.00 1.00 -1.00 -2.00
2.0 2.00 0.00 1.00 9.00 3.0
2.0_MAD 1.00 3.00 2.00 -6.00
3.0 5.00 8.00 0.00 1.00 3.5
3.0_MAD -1.50 -4.50 3.50 2.50
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD -1.25 -2.25 1.75 1.75
IIUC use:
#convert ID to index
df = df.set_index('ID')
#mean to Series
mean = df.mean(axis=1)
from toolz import interleave
#subtract all columns by mean, add suffix
df1 = df.sub(mean, axis=0).abs().rename(index=lambda x: f'{x}_MAD')
#join with original with mean and interleave indices
df = pd.concat([df.assign(mean=mean), df1]).loc[list(interleave([df.index, df1.index]))]
print (df)
1-jan 2-jan 3-jan 4-jan mean
ID
1.0 4.00 5.00 7.00 8.00 6.00
1.0_MAD 2.00 1.00 1.00 2.00 NaN
2.0 2.00 0.00 1.00 9.00 3.00
2.0_MAD 1.00 3.00 2.00 6.00 NaN
3.0 5.00 8.00 0.00 1.00 3.50
3.0_MAD 1.50 4.50 3.50 2.50 NaN
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD 1.25 2.25 1.75 1.75 NaN
It's possible to specify axis=1 to apply the mean calculation across columns:
df['mean_across_cols'] = df.mean(axis=1)

Incorrect exported data - How to split values in specific column and shift it to other columns in pandas dataframe?

I exported some data and this export did not work completely fine.
I read my data into a pandas dataframe and it now looks like this:
time A B C D E F
1 NaN nullnull0.54 0.74 0.89 NaN NaN
2 NaN nullnull0.01 3.32 1.19 NaN NaN
3 NaN nullnull1.89 0.65 4.50 NaN NaN
4 NaN nullnull4.64 2.87 2.22 NaN NaN
5 0.52 1.43 3.56 5.65 0.06 1.11
6 3.51 0.89 0.96 1.10 2.08 4.29
7 0.11 10.20 3.36 2.15 0.70 1.99
timeis my first column and then I have six columns A to F.
Column A is correct. I did not get any data here. The problem begins in column B. For B and C I could actually not extract values other than this null values. But instead writing this null values in column B and C in my csv-file it writes nullnull0.54 in column B and then fills columns C to F with the other extracted values and adds NaN values for E and F. I.e. values in C should be E and D should be in F for all the rows where this nullnull pattern is observed (and the numeric value in B should be added to D. That means, I would need to write a code which splits the value in B in three party null, null and value and then shift the numeric values two columns to the left beginning with this numeric value in B and only for rows where this nullnull-pattern is observed.
Edit:
The output should look like this:
time A B C D E F
1 NaN null null 0.54 0.74 0.89
2 NaN null null 0.01 3.32 1.19
3 NaN null null 1.89 0.65 4.50
4 NaN null null 4.64 2.87 2.22
5 0.52 1.43 3.56 5.65 0.06 1.11
6 3.51 0.89 0.96 1.10 2.08 4.29
7 0.11 10.20 3.36 2.15 0.70 1.99
I used this code to read the csv-file:
df = pd.read_csv(r'path\to\file.csv',delimiter=';',names=['time','A','B','C','D','E','F'],index_col=False)
It is not due to the code I used to read the file. It is due to the export which went wrong. I also get this nullnullxyz-values in one column as outputs in my csv file.
Firstly, I would suggest fixing the corrupt csv file, or better the root cause of the corruption, before loading it into pandas.
If you really have to do it in pandas, here is a slow-and-dirty fix using .apply():
def fix(row: pd.Series) -> pd.Series:
"""Fix 'nullnull' assuming it occurs only in column B."""
if str(row["B"]).startswith("nullnull"):
return pd.Series([row["time"], row["A"], float('nan'), float('nan'), float(row["B"][8:]), row["C"], row["D"]],
index=df.columns)
else: # no need to fix
return row
# apply the fix for each row
df2 = df.apply(fix, axis=1)
# column B is in object type originally
df2["B"] = df2["B"].astype(float)
Output
print(df2)
time A B C D E F
0 1.0 NaN NaN NaN 0.54 0.74 0.89
1 2.0 NaN NaN NaN 0.01 3.32 1.19
2 3.0 NaN NaN NaN 1.89 0.65 4.50
3 4.0 NaN NaN NaN 4.64 2.87 2.22
4 5.0 0.52 1.43 3.56 5.65 0.06 1.11
5 6.0 3.51 0.89 0.96 1.10 2.08 4.29
6 7.0 0.11 10.20 3.36 2.15 0.70 1.99
Also verify data types:
print(df2.dtypes)
time float64
A float64
B float64
C float64
D float64
E float64

If the value of particular ID does not exist in another ID, insert row with value to ID

I would like to update and insert a new row, if D1 value is not existing in other ID's, whilst my df['Value'] is left blank (N/A). Your help is appreciated.
Input
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.02 2 4.5
0.04 2 4.1
0.08 2 3.6
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Expected output:
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.1 1
0.02 2 4.5
0.04 2 4.1
0.06 2
0.08 2 3.6
0.1 2
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Unfortunately the codes I have written have been way off or simply gets multiple error messages, unlike my other questions I do not have examples to show.
Use unstack and stack. Chain additional sort_index and reset_index to achieve desired order
df_final = (df.set_index(['D1', 'ID']).unstack().stack(dropna=False)
.sort_index(level=[1,0]).reset_index())
Out[952]:
D1 ID Value
0 0.02 1 1.2
1 0.04 1 1.6
2 0.06 1 1.9
3 0.08 1 2.8
4 0.10 1 NaN
5 0.02 2 4.5
6 0.04 2 4.1
7 0.06 2 NaN
8 0.08 2 3.6
9 0.10 2 NaN
10 0.02 3 2.7
11 0.04 3 2.9
12 0.06 3 2.4
13 0.08 3 2.1
14 0.10 3 1.9

binning two dimensional data by its index in python

How would I bin some data based on the index of the data, in python 3
Let's say I have the following data
1 0.5
3 0.6
5 0.7
6 0.8
8 0.9
10 1
11 1.1
12 1.2
14 1.3
15 1.4
17 1.5
18 1.6
19 1.7
20 1.8
22 1.9
24 2
25 2.1
28 2.2
31 2.3
35 2.4
how would I take this data and bin both columns such that each bin has n number of values in it, and average the numbers in each bin and output them.
for example, if I wanted to bin the values by 4
I would take the first four data points:
1 0.5
3 0.6
5 0.7
6 0.8
and the averages of these would be: 3.75 0.65
I would continue down the columns by taking the next set of four, and so on
until I averaged all of the sets of four to get this:
3.75 0.65
10.25 1.05
16 1.45
21.25 1.85
29.75 2.25
How can I do this using python
Base on numpy reshape
pd.DataFrame([np.mean(x.reshape(len(df)//4,-1),axis=1) for x in df.values.T]).T
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25
You can "bin" the index into groups of 4 and call groupby in the index.
df.groupby(df.index // 4).mean()
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25

How to rank values in different columns based on a criteria in Python (transferring Excel calculations to Python)?

I have a workbook in Excel with a bunch of calculations, but the file is getting too large and the calculations are taking too long to finish in Excel so I am trying to move the file to another program (preferably Python) to handle the calculations. I only have basic experience with Python, so I'm not even sure if this is the best software to handle this calculation.
Anyways, the below is the table I am working with (this is a smaller version of the actual table; the actual table has over 35,000 rows).
State Item # Val1 Val2 Val3 Val4 Val5 Rank1 Rank2 Rank3 Rank4 Rank5 Count
CA 1 5.55 4.16 3.12 2.34 1.76 2 5 8 11 14 2
CA 2 6.43 4.82 3.62 2.71 2.03 1 3 6 9 12 2
CA 3 4.79 3.59 2.69 2.02 1.52 4 7 10 13 15 1
FL 4 10.41 7.81 5.86 4.39 3.29 1 3 5 7 9 3
FL 5 8.02 6.02 4.51 3.38 2.54 2 4 6 8 11 2
FL 6 3.22 2.42 1.81 1.36 1.02 10 12 13 14 15 0
NY 7 0.97 0.73 0.55 0.41 0.31 8 10 12 14 15 0
NY 8 1.44 1.08 0.81 0.61 0.46 6 7 9 11 13 0
NY 9 14.31 10.73 8.05 6.04 4.53 1 2 3 4 5 5
WA 10 9.31 6.98 5.24 3.93 2.95 1 3 5 7 9 3
WA 11 8.91 6.68 5.01 3.76 2.82 2 4 6 8 10 2
WA 12 1.55 1.16 0.87 0.65 0.49 11 12 13 14 15 0
The columns State, Item #, Val1, Val2, Val3, Val4, and Val5 are my input data. What I need to do is find the top 5 values by each state, and count up how many of the top 5 values each item # has. I have done the calculations in Excel in the Rank1-Rank5 and Count columns. I'm wondering if this can be done in Python, and if so, how? I also want the code to be flexibile to allow me to add in more "Val" columns (might go up to 10 values).
Thanks!
Usually when working with tabular data in Python, the pandas library is a good tool to reach for. There are lots of ways to do what you want, IIUC, but here's one which shouldn't be too hard to follow. It's mostly to give you a sense for the kinds of thing you can do. Starting from a DataFrame looking like yours:
>>> df
State Item # Val1 Val2 Val3 Val4 Val5
0 CA 1 5.55 4.16 3.12 2.34 1.76
1 CA 2 6.43 4.82 3.62 2.71 2.03
2 CA 3 4.79 3.59 2.69 2.02 1.52
3 FL 4 10.41 7.81 5.86 4.39 3.29
4 FL 5 8.02 6.02 4.51 3.38 2.54
5 FL 6 3.22 2.42 1.81 1.36 1.02
6 NY 7 0.97 0.73 0.55 0.41 0.31
7 NY 8 1.44 1.08 0.81 0.61 0.46
8 NY 9 14.31 10.73 8.05 6.04 4.53
9 WA 10 9.31 6.98 5.24 3.93 2.95
10 WA 11 8.91 6.68 5.01 3.76 2.82
11 WA 12 1.55 1.16 0.87 0.65 0.49
we can (1) turn it so the data is all vertical, (2) rank them so that low numbers are associated with the highest scores (with lots of options for how to handle ties; I'm ignoring those issues), (3) decide which ones we're interested in, and (4) count them by State/Item # combination. (In principle I guess an item could belong to more than one state, in which case we'd just drop the State from that last groupby).
df_m = pd.melt(df, id_vars=["State", "Item #"], var_name="Value")
df_m["rank"] = df_m.groupby("State")["value"].rank(ascending=False)
df_m["top"] = rank <= 5
df_m.groupby(["State", "Item #"], as_index=False)["top"].sum()
which finally produces
State Item # top
0 CA 1 2
1 CA 2 2
2 CA 3 1
3 FL 4 3
4 FL 5 2
5 FL 6 0
6 NY 7 0
7 NY 8 0
8 NY 9 5
9 WA 10 3
10 WA 11 2
11 WA 12 0
That's simply a melt (a kind of pivot operation); a groupby; a rank; a comparison; another groupby; and a sum (True == 1, so summing booleans is a count). Might be a bit scary for a complete beginner, but hopefully it'll encourage you to give pandas a try, because with just a bit of experience you can do a lot of operations quite efficiently.
Pandas is probably the best tool for this kind of task. Start here. There are many online tutorials and YouTube videos about it. For example, this is by the original author himself.

Categories