How to Open a Text File and Create an Array in Python - python

I have a text file called Orbit 1 and I need help opening it and then creating three separate arrays. I'm new to Python and have been having difficulty with this aspect. Here are the first few rows of my text file. There are 1112 rows including the header.
Year Month Day Hour Minute Second Millisecond Longitude Latitude Altitude
2019 3 17 5 55 55 0 108.8730074 50.22483151 412.6226898
2019 3 17 5 56 0 0 108.9895097 50.53642185 412.7368197
2019 3 17 5 56 5 0 109.1078294 50.8478274 412.850563
2019 3 17 5 56 10 0 109.2280101 51.15904424 412.9640113
2019 3 17 5 56 15 0 109.3500969 51.47006828 413.0772319
2019 3 17 5 56 20 0 109.4741362 51.78089533 413.1901358
2019 3 17 5 56 25 0 109.6001758 52.09152105 413.3025291
2019 3 17 5 56 30 0 109.728265 52.40194099 413.414457
2019 3 17 5 56 35 0 109.8584548 52.71215052 413.5259984
2019 3 17 5 56 40 0 109.9907976 53.02214489 413.6371791
I desire to open this text file to create three arrays called lat[N], long[N], and time[N] where N is the number of rows in the file. I ultimately want to be able to determine what the latitude, longitude, and time is at any point. For example, lat[0] should return 50.22483151 if working properly. In addition, for the time, I would need to convert to decimal hours and then create the array.
Essentially I need help with opening this text file I have and then creating the three arrays.
I've tried this method for opening the file, but I get stuck when trying to write the array and I think I may not be opening the file correctly.
import numpy as np
file_name = 'C:\\Users\\Saman\\OneDrive\\Documents\\Orbit 1.txt'
data = []
with open(file_name) as file:
next(file)
for line in file:
row = line.split()
row = [float(x) for x in row]
data.append(row)

The most effortless way to solve your problem is to use Pandas:
import pandas as pd
df = pd.read_table('Orbit 1.txt', sep=r'\s+')
df['Longitude']
#0 108.873007
#1 108.989510
#2 109.107829
#3 109.228010
#4 109.350097
#5 109.474136
#6 109.600176
#7 109.728265
#8 109.858455
#9 109.990798
Once you get a Pandas DataFrame, you may want to use it for the rest of the data processing, too.

file_name = 'info.txt'
Lat=[]
Long=[]
Time=[]
left_justified=lambda x: x+" "*(19-len(x))
right_justified=lambda x: " "*(19-len(x))+x
with open(file_name) as file:
next(file)
for line in file:
data=line.split()
Lat.append(data[8])
Long.append(data[7])
hrs=int(data[3])
minutes=int(data[4])
secs=int(data[5])
total_secs=secs+minutes*60+hrs*3600
Time.append(total_secs/3600)
print(left_justified("Time"),left_justified("Lat"),left_justified("Long"))
for i in range(len(Lat)):
print(left_justified(str(Time[i])),left_justified(Lat[i]),left_justified(Long[i]))
Try this

Related

Data Extraction from multiple excel files in pandas dataframe

I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in the pandas data frame. The structuring of the tabs in each of the excel files is the same. Each tab of the excel file should be a separate data frame. As of now, I have created a list of data frames for each excel file that holds all the data from all the tabs as it is concatenated. But, I'm trying to find a way to access each excel from a data structure and each tab of that excel file as a separate data frame. Below mentioned is the current code. Any improvisation would be appreciated!! Please let me know if anything else is needed.
#Assigning the path to the folder variable
folder = 'specified_path'
#Getting the list of files from the assigned path
excel_files = [file for file in os.listdir(folder)]
list_of_dfs = []
for file in excel_files :
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None), ignore_index=True)
df['excelfile_name'] = file.split('.')[0]
list_of_dfs.append(df)
I would propose to change the line
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None), ignore_index=True)
to
df = pd.concat(pd.read_excel(folder + "\\" + file, sheet_name=None))
df.index = df.index.get_level_values(0)
df.reset_index().rename({'index':'Tab'}, axis=1)
To create a separate dataframe for each tab (with duplicated content) in an Excel file, one could iterate over index level 0 values and index with it:
df = pd.concat(pd.read_excel(filename, sheet_name=None))
list_of_dfs = []
for tab in df.index.get_level_values(0).unique():
tab_df = df.loc[tab]
list_of_dfs.append(tab_df)
For illustration, here is the dataframe content after reading an Excel file with 3 tabs:
After running the above code, here is the content of list_of_dfs:
[ Date Reviewed Adjusted
0 2022-07-11 43 20
1 2022-07-18 16 8
2 2022-07-25 8 3
3 2022-08-01 17 3
4 2022-08-15 14 6
5 2022-08-22 12 5
6 2022-08-29 8 4,
Date Reviewed Adjusted
0 2022-07-11 43 20
1 2022-07-18 16 8
2 2022-07-25 8 3
3 2022-08-01 17 3
4 2022-08-15 14 6
5 2022-08-22 12 5
6 2022-08-29 8 4,
Date Reviewed Adjusted
0 2022-07-11 43 20
1 2022-07-18 16 8
2 2022-07-25 8 3
3 2022-08-01 17 3
4 2022-08-15 14 6
5 2022-08-22 12 5
6 2022-08-29 8 4]

Creating Multiple DataFrames from single DataFrame based on different values of single column

I have 3 days of time series data with multiple columns in it. I have one single DataFrame which includes all 3 days data. I want 3 different DataFrames based on Column name "Dates" i.e df["Dates"]
For Example:
Available Dataframe is: df
Expected Output: Based on Three different Dates
First DataFrame: df_23
Second DataFrame: df_24
Third DataFrame: df_25
I want to use these all three DataFrames separately for analysis.
I tried below code but I am not able to use those three dataframes (Rather I don't know how to use.) Can anybody help me to work my code better. Thank you.
Above code is just printing the DataFrame in three DataFrames that too not as expected as per code!
Unsure if your saving your variable into a csv or keep it in memory for further use,
you could pass each unique value into a dict and access by it's value :
print(df)
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
4 54 24
5 10 24
6 77 24
7 95 24
8 58 25
9 53 25
10 44 25
11 94 25
d = {}
for frame, data in df.groupby('Dates'):
d[f'df{frame}'] = data
print(d['df23'])
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
edit updated request :
for k,v in d.items():
i = (v['Cal'].loc[v['Cal'] > 70].count())
print(f"{v['Dates'].unique()[0]} --> {i} times")
23 --> 4 times
24 --> 2 times
25 --> 1 times

how to add a new column without opening a csv file

I scrapped data and exported as csv files.
For simplicity, the data look like below
(I intentionally put arbitrary variables to just illustrate an example):
id var1 var2 var3 ...
A 10 14 355 ...
B 35 56 22 ...
C 95 22 222 ...
D 44 55 222 ...
Since I collected the data daily, I saved my file name as city_20180814_result.csv
For example, if I collected the data in NYC at Aug 14th 2018, the corresponding file name is NYC_20180814_result.csv
Here, I want to add a new column, the date variable, into each csv file.
The desired example is going to be like the one below. To be specific, I want to add a date (YYYYMMDD as a format) column to each csv file and the values are going to be the date when the data were collected. For example, the below example csv file was generated on Aug 14th 2018, then the updated data will look like below:
id date var1 var2 var3 ...
A 20180814 10 14 355 ...
B 20180814 35 56 22 ...
C 20180814 95 22 222 ...
D 20180814 44 55 222 ...
The conventional way to do this is to open every csv file and manually add a new column, assign a corresponding date to all the rows, and repeat this step for all csv files. But there are too many to get this done. Is there any way to do this efficiently? Since I saved file names including the date, it would be a good idea to use this if it's possible. Any help/codes (by using python again or excel macro) would be appreciated.
My solution using python's pandas package:
import os
import re
import pandas as pd
FILE_PATTERN = re.compile(r'(.*)_(\d{8})_result.csv')
def addDate(file_dir):
csv_list = [csvfile for csvfile in os.listdir(file_dir) if re.fullmatch(FILE_PATTERN, csvfile)]
for csvname in csv_list:
date = re.fullmatch(FILE_PATTERN, csvname).group(2)
df = pd.read_csv(os.path.join(file_dir, csvname))
df.insert(loc=1, column='date', value=[date]*len(df))
df.to_csv(os.path.join(file_dir, csvname), index=False)
Sample input: NYC_20180814_result.csv in some_path:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
Same csv after executing addDate(some_path):
A date B C
0 0 20180814 1 2
1 3 20180814 4 5
2 6 20180814 7 8
P.S. You'll not see the index column in your csv file.

Python processing CSV file really slow

So I am trying to open a CSV file, read its fields and based on that fix some other fields and then save that data back to csv. My problem is that the CSV file has 2 million rows. What would be the best way to speed this up.
The CSV file consists of
ID; DATE(d/m/y); SPECIAL_ID; DAY; MONTH; YEAR
I am counting how often a row with the same date appears on my record and then update SPECIAL_ID based on that data.
Based on my previous research I decided to use pandas. I'll be processing even bigger sets of data in future (1-2GB) - this one is around 119MB so it crucial I find a good fast solution.
My code goes as follows:
df = pd.read_csv(filename, delimiter=';')
df_fixed= pd.DataFrame(columns=stolpci) #when I process the row in df I append it do df_fixed
d = 31
m = 12
y = 100
s = (y,m,d)
list_dates= np.zeros(s) #3 dimensional array.
for index, row in df.iterrows():
# PROCESSING LOGIC GOES HERE
# IT CONSISTS OF FEW IF STATEMENTS
list_dates[row.DAY][row.MONTH][row.YEAR] += 1
row['special_id'] = list_dates[row.DAY][row.MONTH][row.YEAR]
df_fixed = df_fixed.append(row.to_frame().T)
df_fixed .to_csv(filename_fixed, sep=';', encoding='utf-8')
I tried to make a print for every thousand rows processed. At first, my script needs 3 seconds for 1000 rows, but the longer it runs the slower it gets.
at row 43000 it needs 29 seconds and so on...
Thanks for all future help :)
EDIT:
I am adding additional information about my CSV and exptected output
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505__-;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505__-;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505__-;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505__-;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505__-;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505__-;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505__-;F;6;1001001;1001001_F_6;16;8;2011
I have to replace - in the special ID field to a proper number.
For example for a row with
ID = 2 the SPECIAL_ID will be
26022018505001 (- got replaced by 001) if someone else in the CSV shares the same DAY, MONTH, YEAR the __- will be replaced by 002 and so on...
So exptected output for above rows would be
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505001;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505001;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505001;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505001;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505001;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505001;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505002;F;6;1001001;1001001_F_6;16;8;2011
EDIT:
I changed my code to something like this: I fill list of dicts with data and then convert that list do dataframe and save as csv. This will take around 30minutes to complete
list_popravljeni = []
df = pd.read_csv(filename, delimiter=';')
df_dates = df.groupby(by=['dan_roj', 'mesec_roj', 'leto_roj']).size().reset_index()
for index, row in df_dates.iterrows():
df_candidates= df.loc[(df['dan_roj'] == dan_roj) & (df['mesec_roj'] == mesec_roj) & (df['leto_roj'] == leto_roj) ]
for index, row in df_candidates.iterrows():
vrstica = {}
vrstica['ID'] = row['identifikator']
vrstica['SPECIAL_ID'] = row['emso'][0:11] + str(index).zfill(2)
vrstica['day'] = row['day']
vrstica['MONTH'] = row['MONTH']
vrstica['YEAR'] = row['YEAR']
list_popravljeni.append(vrstica)
pd.DataFrame(list_popravljeni, columns=list_popravljeni[0].keys())
I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.
df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])
I added a fake record at the end to confirm it does increment properly:
SPECIAL_ID sex age zone key day month year counts
0 13012016505001 F 1 1001001 1001001_F_1 13 1 2016 001
1 25122013505001 F 4 1001001 1001001_F_4 25 12 2013 001
2 24022012505001 F 5 1001001 1001001_F_5 24 2 2012 001
3 09032012505001 F 5 1001001 1001001_F_5 9 3 2012 001
4 21082011505001 F 6 1001001 1001001_F_6 21 8 2011 001
5 16082011505001 F 6 1001001 1001001_F_6 16 8 2011 001
6 21102011505002 F 6 1001001 1001001_F_6 16 8 2011 002
7 21102012505003 F 6 1001001 1001001_F_6 16 8 2011 003
If you want to get rid of counts, you just need:
df.drop('counts', inplace=True, axis=1)

Matching 'Date' dataframes in Pandas to enable joins/merging

I have two csv files with pandas dataframes with a 'Date' column, which is my desired target to join the two tables (my goal is to join my two csvs by dates and merge matching dataframes by summing them).
The issue is that despite sharing the same month-year format, my first csv abbreviated the years, whereas my desired output would be mm-yyyy (for example, Aug-2012 as opposed to Aug-12).
csv1:
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
...
has 41 rows; i.e. 41 months worth of data between Oct. 12 - Feb. 16
csv2:
0 Jan-2009 943690
1 Feb-2009 1062565
2 Mar-2009 210079
3 Apr-2009 -735286
4 May-2009 842933
5 Jun-2009 358691
6 Jul-2009 914953
7 Aug-2009 723427
8 Sep-2009 -837468
...
has 86 rows; i.e. 41 months worth of data between Jan. 2009 - Feb. 2016
I tried initially to do something akin to a 'find and replace' function as one would in Excel.
I tried :
findlist = ['12','13','14','15','16']
replacelist = ['2012','2013','2014','2015','2016']
def findReplace(find, replace):
s = csv1_df.read()
s = s.replace(Date, replacement)
csv1_dfc.write(s)
for item, replacement in zip(findlist, replacelist):
s = s.replace(Date, replacement)
But I am getting a
NameError: name 's' is not defined
You can use to_datetime to transform to datetime format, and then strftime to adjust your format:
df['col_date'] = pd.to_datetime(df['col_date'], format="%b-%y").dt.strftime('%b-%Y')
Input:
col_date val
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
Output:
col_date val
0 Oct-2012 1154293
1 Nov-2012 885773
2 Dec-2012 -448704
3 Jan-2013 563679
4 Feb-2013 555394
5 Mar-2013 631974
6 Apr-2013 957395
7 May-2013 1104047
8 Jun-2013 693464
Note the lower case y for 2 digits year and upper case Y for 4 digits year.

Categories