Using Pandas in Python to Join Multiple Files Based on Date - python

I have csv files that I need to join together based upon date but the dates in each file are not the same (i.e. some files start on 1/1/1991 and other in 1998). I have a basic start to the code (see below) but I am not sure where to go from here. Any tips are appreciated. Below please find a sample of the different csv I am trying to join.
import os, pandas as pd, glob
directory = r'C:\data\Monthly_Data'
files = os.listdir(directory)
print(files)
all_data =pd.DataFrame()
for f in glob.glob(directory):
df=pd.read_csv(f)
all_data=all_data.append(df,ignore_index=True)
all_data.describe()
File 1
DateTime F1_cfs F2_cfs F3_cfs F4_cfs F5_cfs F6_cfs F7_cfs
3/31/1991 0.860702028 1.167239264 0 0 0 0 0
4/30/1991 2.116930556 2.463493056 3.316688418
5/31/1991 4.056572581 4.544307796 5.562668011
6/30/1991 1.587513889 2.348215278 2.611659722
7/31/1991 0.55328629 1.089637097 1.132043011
8/31/1991 0.29702957 0.54186828 0.585073925 2.624375
9/30/1991 0.237083333 0.323902778 0.362583333 0.925563094 1.157786606 2.68722973 2.104090278
File 2
DateTime F1_mg-P_L F2_mg-P_L F3_mg-P_L F4_mg-P_L F5_mg-P_L F6_mg-P_L F7_mg-P_L
6/1/1992 0.05 0.05 0.06 0.04 0.03 0.18 0.08
7/1/1992 0.03 0.05 0.04 0.03 0.04 0.05 0.09
8/1/1992 0.02 0.03 0.02 0.02 0.02 0.02 0.02
File 3
DateTime F1_TSS_mgL F1_TVS_mgL F2_TSS_mgL F2_TVS_mgL F3_TSS_mgL F3_TVS_mgL F4_TSS_mgL F4_TVS_mgL F5_TSS_mgL F5_TVS_mgL F6_TSS_mgL F6_TVS_mgL F7_TSS_mgL F7_TVS_mgL
4/30/1991 10 7.285714286 8.5 6.083333333 3.7 3.1
5/31/1991 5.042553191 3.723404255 6.8 6.3 3.769230769 2.980769231
6/30/1991 5 5 1 1
7/31/1991
8/31/1991
9/30/1991 5.75 3.75 6.75 4.75 9.666666667 6.333333333 8.666666667 5 12 7.666666667 8 5.5 9 6.75
10/31/1991 14.33333333 9 14 10.66666667 16.25 11 12.75 9.25 10.25 7.25 29.33333333 18.33333333 13.66666667 9
11/30/1991 2.2 1.933333333 2 1.88 0 0 4.208333333 3.708333333 10.15151515 7.909090909 9.5 6.785714286 4.612903226 3.580645161

You didn't read the csv files correctly.
1) You need to comment out the following lines because you never use it later in your code.
files = os.listdir(directory)
print(files)
2) glob.glob(directory) didnt return any match files. glob.glob() takes pattern as argument, for example: 'C:\data\Monthly_Data\File*.csv', unfortunately you put a directory as a pattern, and no files are found
for f in glob.glob(directory):
I modified the above 2 parts and print all_data, the file contents display on my console

Related

Create Tabular Dataset in Azure using python sdk

So I'm just starting with Azure and I have this problem:
Here is my code:
def getWorkspace(name):
ws = Workspace.get(
name=name,
subscription_id= sid,
resource_group='my_ressource',
location='my_location')
return ws
def uploadDataset(ws, file, separator=','):
datastore = Datastore.get_default(ws)
path = DataPath(datastore=datastore,path_on_datastore=file)
dataset = TabularDatasetFactory.from_delimited_files(path=path, separator=separator)
#dataset = Dataset.Tabular.from_delimited_files(path=path, separator=separator)
print(dataset.to_pandas_dataframe().head())
print(type(dataset))
ws = getWorkspace(workspace_name)
uploadDataset(ws, my_csv,";")
#result :
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides ... density pH sulphates alcohol quality0 7.5 0.33 0.32 11.1 0.036 ... 0.99620 3.15 0.34 10.5 61 6.3 0.27 0.29 12.2 0.044 ... 0.99782 3.14 0.40 8.8 62 7.0 0.30 0.51 13.6 0.050 ... 0.99760 3.07 0.52 9.6 73 7.4 0.38 0.27 7.5 0.041 ... 0.99535 3.17 0.43 10.0 54 8.1 0.12 0.38 0.9 0.034 ... 0.99026 2.80 0.55 12.0 6
[5 rows x 12 columns]
<class 'azureml.data.tabular_dataset.TabularDataset'>
But when I go to Microsoft Azure Machine Learning Studio in datasets, this dataset isn't created.
What am I doing wrong?
Firstly we need to check the format of the file, if the format is .csv or .tsv we need to use from_delimited_files() method which has TabularDataSetFactory class to read files. Or else if we have .paraquet files we have a method called as from_parquet_files(). Along with these we have register_pandas_dataframe() method which registers the TabularDataset to the workspace and uploads data to your underlying storage
Also for the storage is there is any virtual network or firewalls enabled then make sure that we set a parameter as validate=False in from_delimited_files() method as this will skip the validation/verification step.
Specify the datastore name as below along with Workspace:
datastore_name = 'your datastore name'
workspace = Workspace.from_config() #if we have existing work space.
datastore = Datastore.get(workspace, datastore_name)
Below is the way to create TabularDataSets from 3 file paths.
datastore_paths = [(datastore, 'weather/2018/11.csv'),
(datastore, 'weather/2018/12.csv'),
(datastore, 'weather/2019/*.csv')]
Create_TBDS = Dataset.Tabular.from_delimited_files(path=datastore_paths)
If we want to specify the separator, we can do it as below:
Create_TBDS = Dataset.Tabular.from_delimited_files(path=datastore_paths, separator=',')

Load files with two indexes in one dataframe

How to load 28 files with the same amount of rows and columns so it wont iterate index through all files data 0-2911, but only one file data with index 0-103 and give a second index 1-28 for every new file data started?
Here is the code that I wrote that iterates through all data:
import pandas as pd
import glob
path = r"C:/Users/Measurment_Data/Test_1"
all_files = glob.glob(path + "/*.dat")
li = []
for filename in all_files:
df = pd.read_csv(filename, sep="\t", names=["Voltage", "Current"], header=None)
li.append(df)
frame = pd.concat(li, axis = 0, ignore_index = True)
frame
Output:
ID Voltage Current
0 NaN 1.000000e+00
1 0.00 -3.047149e-06
2 0.04 -4.941096e-06
3 0.08 -4.472754e-06
4 0.12 -1.053477e-05
... ... ...
2907 -0.16 1.194359e-06
2908 -0.12 5.489425e-06
2909 -0.08 -9.656614e-09
2910 -0.04 -3.427169e-06
2911 -0.00 -2.173696e-06
I would like to have new indexes for every new loaded file. Something like this:
File ID Curr Volt
1 0 0.00 1.00E+00
1 1 0.00 -3.05E-06
1 2 0.04 -4.94E-06
...
1 102 0.08 -4.47E-06
1 103 0.12 -1.05E-05
...
2 0 0.00 2.00E+00
2 1 4.00 -3.05E-06
2 2 0.44 -3.94E-06
...
2 102 5.08 -6.47E-06
2 103 0.22 -6.05E-05
...
...
27 0 0.00 2.00E+00
27 1 4.00 -3.05E-06
27 2 0.44 -3.94E-06
...
27 102 5.08 -6.47E-06
27 103 0.22 -6.05E-05
...
28 0 0.00 2.00E+00
28 1 4.00 -3.05E-06
28 2 0.44 -3.94E-06
...
28 102 5.08 -6.47E-06
28 103 0.22 -6.05E-05
I would like to easily access the values of every file with index, so for example all values from 0-5 from 28 files.
Just define a new column after you read every file, then concatenate using default value of ignore_index:
import pandas as pd
import glob
path = r"C:/Users/Measurment_Data/Test_1"
all_files = glob.glob(path + "/*.dat")
li = []
j = 1
for filename in all_files:
df = pd.read_csv(filename, sep="\t", names=["Voltage", "Current"], header=None)
df.insert(0, 'File', '')
df["File"] = j
j += 1
li.append(df)
frame = pd.concat(li, axis = 0)
frame
Give it a try!

Remove Columns from DataFrame based on Standard Deviation

I am trying to do something that I think should be rather simple but I am stuck.
I would like to be able to get the standard deviation of each column in my dataframe and remove that column if the standard deviation is below a set number. This is as far as I have gotten.
stdev_min = 0.6
df = pd.DataFrame(np.random.randn(20, 5), columns=list('ABCDE'))
namelist = list(df.columns.values.tolist())
stdev = pd.DataFrame(df.std())
I've tried a few things but nothing worth mentioning, any help would be greatly appreciated.
You don't need any loops.
You rarely do with pandas.
In this case, you need boolean indexing:
import pandas
import numpy
numpy.random.seed(37)
stdev_min = 0.95
df = pandas.DataFrame(numpy.random.randn(20, 5), columns=list('ABCDE'))
So now df.std() gives me:
A 0.928547
B 0.859394
C 0.998692
D 1.187380
E 1.092970
dtype: float64
so I can do
df.loc[:, df.std() > stdev_min]
And get:
C D E
0 0.35 -1.30 1.52
1 -0.45 0.96 -0.83
2 0.52 -0.06 -0.03
3 1.89 0.40 0.19
4 -0.27 -2.07 -0.71
5 -1.72 -0.40 1.27
6 0.44 -2.05 -0.23
7 1.76 0.06 0.36
8 -0.30 -2.05 1.68
9 0.34 1.26 -1.08
10 0.10 -0.48 -1.74
11 1.95 -0.08 1.51
12 0.43 -0.06 -0.63
13 -0.30 -1.06 0.57
14 -0.95 -1.45 0.93
15 -1.13 2.23 -0.88
16 -0.77 0.86 0.58
17 0.93 -0.11 -1.29
18 -0.82 0.03 -0.44
19 0.40 1.13 -1.89
Here's a way to do this.
Iterate through each column. Get the Standard Deviation for the column. Check if it is less than the minimum standard deviation value. If it is, drop the column using inplace=True
stdev_min = 0.6
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
for col in df.columns:
print (col, df[col].std())
if df[col].std() < stdev_min:
df.drop(col,axis='columns', inplace=True)
print (df)
Output:
A 0.5046725928657507
B 1.1382221163449697
C 1.0318169576864502
D 0.7129102193331575
E 1.3805207184389312
The value of A is less than 0.6 and so it got dropped.
B C D E
0 -0.923822 1.155547 -0.601033 -0.066207
1 0.068844 0.426304 -0.376052 0.368574
2 0.585187 -0.367270 0.530934 0.086811
3 0.021466 1.381579 0.483134 -0.300033
4 0.351492 -0.648734 -0.736213 0.827953
5 0.155731 -0.004504 0.315432 0.310515
6 -1.092933 1.341933 -0.672240 -3.482960
7 -0.587766 0.227846 0.246781 1.978528
8 1.565055 0.527668 -0.371854 -0.030196
9 -2.634862 -1.973874 1.508080 -0.362073
Did a few more runs. Here's an example with before and after.
DF before
A B C D E
0 0.496740 0.799021 1.655287 0.091138 0.309186
1 -0.580667 -0.749337 -0.521909 -0.529410 1.010981
2 0.212731 0.126389 -2.244500 0.400540 -0.148761
3 -0.424375 -0.832478 -0.030865 -0.561107 0.196268
4 0.229766 0.688040 0.580294 0.941885 1.554929
5 0.676926 -0.062092 -1.452619 0.952388 -0.963857
6 0.683216 0.747429 -1.834337 -0.402467 -0.383881
7 0.834815 -0.770804 1.299346 1.694612 1.171190
8 0.500445 -1.517488 0.610287 -0.601442 0.343389
9 -0.182286 -0.713332 0.526507 1.042717 1.229628
Standard Deviations for each column of DF:
A 0.49088743174291477
B 0.8047513692231202
C 1.333382184686379
D 0.8248456756163864
E 0.8033725216710547
df['A'] is less than 0.6 and so got dropped.
DF after dropping the column.
B C D E
0 0.799021 1.655287 0.091138 0.309186
1 -0.749337 -0.521909 -0.529410 1.010981
2 0.126389 -2.244500 0.400540 -0.148761
3 -0.832478 -0.030865 -0.561107 0.196268
4 0.688040 0.580294 0.941885 1.554929
5 -0.062092 -1.452619 0.952388 -0.963857
6 0.747429 -1.834337 -0.402467 -0.383881
7 -0.770804 1.299346 1.694612 1.171190
8 -1.517488 0.610287 -0.601442 0.343389
9 -0.713332 0.526507 1.042717 1.229628

Extract data out of a csv file

I am trying to extract data out of a csv file and output the data to another csv file.
relate task perform
0 avc asd
1 12 24
2 34 54
3 22 33
4 11 11
5 335 534
Time A B C D
0 0.334 0.334 0.334 0.334
1 0.543 0.543 0.543 0.543
2 0.752 0.752 0.752 0.752
3 0.961 0.961 0.961 0.961
4 1.17 1.17 1.17 1.17
5 1.379 1.379 1.379 1.379
I am writing a python script to read the above table. I want all the data from Time, A, B,C, and D onwards in a separate file.
import csv
import pandas as pd
import os
read_file = False
with open ('xyz.csv', mode = 'r', encoding = 'utf-8') as f_read:
reader = csv.reader(f_read)
for row in reader:
if 'Time' in row
I am stuck here. I read all the data in 'reader'. All the rows should have been parsed inside 'reader'. Now, how can I extract the data from line with Time and onwards into a separate file?
Is there a better method to achieve the above objective?
Should I use pandas instead of regular python commands?
I read many similar answers on stackoverflow but I am confused on how to finish this problem. Your help is appreciated.
Best

How do I get only those lines that has highest value if they are inside a timewindow?

I am new to the python and scripting in general, so I would really appreciate some guidance in writing a python script.
So, to the point:
I have a big number of files in a directory. Some files are empty, other contain rows like that:
16 2009-09-30T20:07:59.659Z 0.05 0.27 13.559 6
16 2009-09-30T20:08:49.409Z 0.22 0.312 15.691 7
16 2009-09-30T20:12:17.409Z -0.09 0.235 11.826 4
16 2009-09-30T20:12:51.159Z 0.15 0.249 12.513 6
16 2009-09-30T20:15:57.209Z 0.16 0.234 11.776 4
16 2009-09-30T20:21:17.109Z 0.38 0.303 15.201 6
16 2009-09-30T20:23:47.959Z 0.07 0.259 13.008 5
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5
16 2009-09-30T20:37:48.609Z -0.02 0.256 12.861 4
16 2009-09-30T20:44:19.359Z 0.14 0.251 12.597 4
16 2009-09-30T20:48:39.759Z 0.03 0.284 14.244 5
16 2009-09-30T20:49:36.159Z -0.07 0.278 13.98 4
16 2009-09-30T20:57:54.609Z 0.01 0.304 15.294 4
16 2009-09-30T20:59:47.759Z 0.27 0.265 13.333 4
16 2009-09-30T21:02:56.209Z 0.28 0.272 13.645 6
and so on.
I want to get this lines out of the files into a new file. But there are some conditionals!
If two or more successive lines are inside a timewindow of 6 seconds, then only the line with highest treshold should be printed into the new file.
So, something like that:
Original:
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5
in output file:
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
Keep in mind, that lines from different files can have times inside 6s window with lines from other files, so the line, that will be in output is the one that has highest treshold from different files.
The code that explains what is what in the lines is here:
import glob
from datetime import datetime
path = './*.cat'
files=glob.glob(path)
for file in files:
in_file=open(file, 'r')
out_file = open("times_final", "w")
for line in in_file.readlines():
split_line = line.strip().split(' ')
template_number = split_line[0]
t = datetime.strptime(split_line[1], '%Y-%m-%dT%H:%M:%S.%fZ')
mag = split_line[2]
num = split_line[3]
threshold = float(split_line[4])
no_detections = split_line[5]
in_file.close()
out_file.close()
Thank you very much for hints, guidelines, ...
you said in the comments you know how to merge multiple files into 1 sorted by t and that the 6 second windows start with the first row and are based on actual data.
so, you need a way to remember the maximum threshold per window and write only after you are sure you processed all rows in a window. sample implementation:
from datetime import datetime, timedelta
from csv import DictReader, DictWriter
fieldnames=("template_number", "t", "mag","num", "threshold", "no_detections")
with open('master_data') as f_in, open("times_final", "w") as f_out:
reader = DictReader(f_in, delimiter=" ", fieldnames=fieldnames)
writer = DictWriter(f_out, delimiter=" ", fieldnames=fieldnames,
lineterminator="\n")
window_start = datetime(1900, 1, 1)
window_timedelta = timedelta(seconds=6)
window_max = 0
window_row = None
for row in reader:
try:
t = datetime.strptime(row["t"], "%Y-%m-%dT%H:%M:%S.%fZ")
threshold = float(row["threshold"])
except ValueError:
# replace by actual error handling
print("Problem with: {}".format(row))
# switch to new window after 6 seconds
if t - window_start > window_timedelta:
# write out previous window before switching
if window_row:
writer.writerow(window_row)
window_start = t
window_max = threshold
window_row = row
# remember max threshold inside a single window
elif threshold > window_max:
window_max = threshold
window_row = row
# don't forget the last window
if window_row:
writer.writerow(window_row)

Categories