Organizing column and header data with pandas, python - python

I'm having a go at using Numpy instead of Matlab, but I'm relatively new to Python.
My current challenge is importing the data in multiple file in a sensible way so that I can use and plot it. The data is organized in columnes (Temperature, Pressure, Time, etc, each file being a measurement period), and I decided pandas was probably the best way to import the data. I was thinking of using top-leve descriptor for each file, and subdescriptors for each column. Thought of doing it something like this.
Reading Multiple CSV Files into Python Pandas Dataframe
The problem is I'd like to retain and use some of the data in the header (for plotting, for instance). There's no column titles, but general info on data mesaurements, something like this:
Flight ID: XXXXXX
Date: 01-27-10 Time: 5:25:19
OWNER
Release Point: xx.304N xx.060E 11 m
Serial Number xxxxxx
Surface Data: 985.1 mb 1.0 C 100% 1.0 m/s # 308 deg.
I really don't know how to extract and store the data in a way that makes sense when combined with the data frame. Thought of perhaps a dictionary, but I'm not sure how to split the data efficiently since there's no consistent divider. Any ideas?

Looks like somebody is working with radiosondes...
When I pull in my radiosonde data I usually put it in a multi-level indexed dataframe. The levels could be of various forms and orders, but something like FLIGHT_NUM, DATE, ALTITUDE, etc. would make sense. Also, when working with sonde data I too want some additional information that does not necessarily need to be stored within the dataframe, so I store that as additional attributes. If I were to parse your file and then store it I would do something along the lines of this (yes, there are modifications that can be made to "improve" this):
import pandas as pd
with open("filename.csv",'r') as data:
header = data.read().split('\n')[:5] # change to match number of your header rows
data = pd.read_csv(data, skiprows=6, skipinitialspace=True, na_values=[-999,'Infinity','-Infinity'])
# now you can parse your header to get out the necessary information
# continue until you have all the header info you want/need; e.g.
flight = header[0].split(': ')[1]
date = header[1].split(': ')[1].split('')[0]
time = header[1].split(': ')[2]
# a lot of the header information will get stored as metadata for me.
# most likely you want more than flight number and date in your metadata, but you get the point.
data.metadata = {'flight':flight,
'date':date}
I presume you have a date/time column (call it "dates" here) within your file, so you can use that to re-index your dataframe. If you choose to use different variables within your multi-level index then the same method applies.
new_index = [(data.metadata['flight'],r) for r in data.dates]
data.index = pd.MultiIndex.from_tuples(new_index)
You now have a multi-level indexed dataframe.
Now, regarding your "metadata". EdChum makes an excellent point that if you copy "data" you will NOT copy over the metadata dictionary. Also, if you save "data" to a dataframe via data.to_pickle you will lose your metadata (more on this later). If you want to keep your metadata you have a couple options.
Save the data on a flight-by-flight basis. This will allow you to store metadata for each individual flight's file.
Assuming you want to have multiple flights within one saved file: you can add an additional column within your dataframe that hold that information (i.e. another column for flight number, another column for surface temperature, etc.), though this will increase the size of your saved file.
Assuming you want to have multiple flights within one saved file (option 2): You can make your metadata dictionary "keyed" by flight number. e.g.
data.metadata = {FLIGHT1:{'date':date},
FLIGHT2:{'date':date}}
Now to store the metadata. Check you my IO class on storing additional attributes within an h5 file posted here.
Your question was quite broad, so you got a broad answer. I hope this was helpful.

Related

Importing 'prefixed-line' csv into pandas

I have a csv formatted like below which I'd like to import into pandas. Basically, it is a bunch of measurements and data from multiple samples.
dataformat1,sample1,sample_source,start_time
dataformat2,parameter1,spec_a_1,spec_b_1,measurement1
dataformat2,parameter2,spec_a_2,spec_b_2,measurement2
dataformat2,parameter3,spec_a_3,spec_b_3,measurement3
dataformat3,result_code_for_sample1,completion_time_for_sample1
...
dataformat1,sampleN,sample_source,start_time
dataformat2,parameter1,spec_a_1,spec_b_1,measurement1
dataformat2,parameter2,spec_a_2,spec_b_2,measurement2
dataformat2,parameter3,spec_a_3,spec_b_3,measurement3
dataformat3,result_code_for_sampleN,completion_time_for_sampleN
Essentially, the first field of the csv is describing the data format of the rest of that line. In the pandas dataframe I would like to import all these into a single dataframe and fill them to relevant section. I am currently planning to do this by:
prepending the line number into the csv
reading each dataformat# into a separate dataframe
combining, sorting, and ffill/bfill/unpivot(?) shenanigans to fill in all the data
I assume this will work, but I am wondering if there's a cleaner way to do this either within pandas or using some other library. It is a somewhat common data logging paradigm in the work I do.

Applying python dictionary to column of a CSV file

I have a CSV file that includes one column data that is not user friendly. I need to translate that data into something that makes sense. Simple find/replace seems bulky since there are dozens if not hundreds of different possible combinations I want to translate.
For instance: BLK = Black or MNT TP = Mountain Top
There are dozens if not hundreds of translations possible - I have lots of them already in a CSV table. The problem is how to use that dictionary to change the values in another CSV table. It is also important to note that this will (eventually) need to run on its own every few minutes - not just a one time translation.
It would be nice if you could describe in more detail what's the data you're working on. I'll do my best guess though.
Let's say you have a CSV file, you use pandas to read it into a data frame named df, and the "not user friendly" column named col.
To replace all the value in column col, first, you need a dictionary containing all the keys (original texts) and values (new texts):
my_dict = {"BLK": "Black", "MNT TP": Mountain Top,...}
Then, map the dictionary to the column:
df["col"] = df["col"].map(lambda x: my_dict.get(x, x))
If a key appears in the dictionary, it will be replaced by the new corresponding value in the dictionary, otherwise, it keeps the original value.

Is there a standard way of fixing missing values in pandas index column?

The problem
I'm working with a data set, which has been given to me as a csv file with lines of the form id,data. I would like to work with this data in a pandas dataframe, with the id as the index.
Unfortunately, somewhere along the data pipeline, my csv files has a number of rows where the id's are missing. Fortunately, the rows of my data are not fully independent, so I can recreate the missing values: each row is linked to its predecessor, and I have access to an oracle which, when given an id, can give me all its data. This includes the id of its predecessor.
My question is therefore whether there's a simple way of filling in these missing values in my dataframe
My Solution
I don't have much experience working with pandas, but after playing around for a bit I came up with the following approach. I start by reading the csv file into a dataframe without setting the index, so I end up with a RangeIndex. I then
Find the location of the rows with missing ids
Add 1 to the index to get the children of each row
Ask the oracle for the parents of each child
Merge the parents and children on the child id
Subtract one from the index again, and set the parent ids
In code:
children = df.loc[df[df['id'].isna()].index + 1, 'id']
parents = pd.Series({x.id: x.parent_id for x in ask_oracle(children)},
name='parent_id')
combined = pd.merge(children, parents, left_on='id', right_index=True)
combined.set_index(combined.index - 1, inplace=True)
df.loc[combined.index, 'id'] = combined['parent_id']
This works, but I'm 95% sure it's going to look like scary black magic in a few months time.
In particular, I'm unhappy with
The way I get the location of the nan rows. Three lots of df[ in one line is just too many
The manual fiddling about with the indices I have to do to get the rows to match up.
Does anyone have any suggestions for a better way of doing things?
The format of the input data is fixed, as are the properties of the oracle, but if there's a smarter way of organising my dataframe, I'm more than happy to hear it.

Using python pandas how can we select very specific rows and associated column

I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck

Is there a way to parse each unique value of a column into individal CSV's?

EDIT: Creating files working, removing columns is not
EDIT2: ALL WORKING! Need help with combining two columns into one key. Is it possible to take two columns, state and county, and then combine them into a state-county key?
I have a COVID-19 data set that I am trying to create tables with. Currently, I have one large dump file from the government github page.
Basically, I am attempting to take every unique value of row State, and create a new csv with the respective columns, only for that state.
So if Arizona has 4 data entries, it would create a new CSV with those four entries.
The sample data set I am retrieving from can be found here.
As we can see, the columns contain identifiers, state names, dates, etc.
I am looking to take each individual state, and create a new csv with all the values for that state including state, country, and the dates from 3/23-3/29.
This is a sample of what the data would look like after it is parsed:
What I believe needs to happen
What I have been working on is parsing out the unique values for the state column, which i did simply through
data=pd.read_csv('deaths.csv')
print (data['Province_State'].unique())
Now, I am trying to figure out how to select specific column, and write the values for the unique states (including all counties for that same state)
Any help would be greatly appreciated!
EDIT:
Here's what I've tried:
def createCSV():
data=pd.read_csv('deaths.csv', delimiter = ',')
data.drop([0,1,2,3,4,5,6,7,8,9,10])
data = data.set_index('Province_State')
data = data.rename(columns=pd.to_datetime)
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('3/23/2020', '3/29/20')] \
.to_csv('{0}.csv'.format(name))
However with this, I get unknown string format for the columns that don't have dates. However, I attempted to drop them based off index, which didn't seem to do anything.
Manually deleting the columns allows for the function i am looking for, but i need to delete the columns with panda for time.
For saving by state:
for name, g in data.groupby('Province_State'):
g.to_csv('{0}.csv'.format(name))
For saving by state while only using certain dates:
data = data.set_index('Province_State')
data = data.rename(columns=pd.to_datetime)
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('3/23/2020', '3/29/20')] \
.to_csv('{0}.csv'.format(name))
This assumes that the only columns are the region name and the dates. If this isn't the case, remove the non-date columns prior to converting them to datetimes.

Categories