Importing 'prefixed-line' csv into pandas - python

I have a csv formatted like below which I'd like to import into pandas. Basically, it is a bunch of measurements and data from multiple samples.
dataformat1,sample1,sample_source,start_time
dataformat2,parameter1,spec_a_1,spec_b_1,measurement1
dataformat2,parameter2,spec_a_2,spec_b_2,measurement2
dataformat2,parameter3,spec_a_3,spec_b_3,measurement3
dataformat3,result_code_for_sample1,completion_time_for_sample1
...
dataformat1,sampleN,sample_source,start_time
dataformat2,parameter1,spec_a_1,spec_b_1,measurement1
dataformat2,parameter2,spec_a_2,spec_b_2,measurement2
dataformat2,parameter3,spec_a_3,spec_b_3,measurement3
dataformat3,result_code_for_sampleN,completion_time_for_sampleN
Essentially, the first field of the csv is describing the data format of the rest of that line. In the pandas dataframe I would like to import all these into a single dataframe and fill them to relevant section. I am currently planning to do this by:
prepending the line number into the csv
reading each dataformat# into a separate dataframe
combining, sorting, and ffill/bfill/unpivot(?) shenanigans to fill in all the data
I assume this will work, but I am wondering if there's a cleaner way to do this either within pandas or using some other library. It is a somewhat common data logging paradigm in the work I do.

Related

Values of the columns are null and swapped in pyspark dataframe

I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks
Result from pyspark:
The values of the column "property_type" have null but the actual dataframe has some value instead of null.
CSV File:
But pyspark is working fine with small datasets. i.e.
In our we faced the similar issue. Things you need to check
Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
Check wether your csv data as multiline
We handled this situation by mentioning the following configuration
spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)
Are you limited to use CSV fileformat?
Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.

Auxiliary data / description in pandas DataFrame

Imagine I have pd.DataFrame containing a lot of rows and I want to add some string data describing/summarizing it.
One way to do this would be to add column with this value, but it would be a complete waste of memory (e.g. when the string would be a long specification of ML model).
I could also insert it in file name - but this has limitations and is very impractical at saving/loading.
I could store this data in separate file, but it is not what I want.
I could make a class based on pd.DataFrame and add this field, but then I am unsure if save/load pickle would still work properly.
So is there some real clean way to store something like "description" in pandas DataFrame? Preferentially such that would withstand to_pickle/read_pickle operations.
[EDIT]: Ofc, this generates next questions such as what if we concatenate dataframes with such informations? But let's stick to saving/loading problem only.
Googled it a bit, you can name a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame( data=np.ones([10,10]) )
df.name = 'My name'
print(df.name)

Converting weird NBA data structure into a pandas dataframe from an API

I am trying to convert this string into a pandas dataframe where each value before the colon is a header and the next value is put in the column. This is what the data structure looks like:
{"data":[{"id":47179,"date":"2019-01-30T00:00:00.000Z","home_team":{"id":2,"abbreviation":"BOS","city":"Boston","conference":"East","division":"Atlantic","full_name":"Boston Celtics","name":"Celtics"},"home_team_score":126,"period":4,"postseason":false,"season":2018,"status":"Final","time":" ","visitor_team":{"id":4,"abbreviation":"CHA","city":"Charlotte","conference":"East","division":"Southeast","full_name":"Charlotte Hornets","name":"Hornets"},"visitor_team_score":94},{"id":48751,"date":"2019-02-09T00:00:00.000Z","home_team":{"id":2,"abbreviation":"BOS","city":"Boston","conference":"East","division":"Atlantic","full_name":"Boston Celtics","name":"Celtics"}}
This sting I am converting has hundreds of games but I stop pasting after two. What is this data structure called and how can I quickly move it to a dataframe?
It looks like json, but it's not structured in a very useful way; everything is under the "data" key. To get it into a DataFrame, use a combination of Python's json module and pandas.normalize_json().
import json
import pandas as pd
data_string = """{"data":[{"id":47179,"date":"2019-01-30T00:00:00.000Z",
"home_team":{"id":2,"abbreviation":"BOS","city":"Boston",
"conference":"East","division":"Atlantic","full_name":"Boston Celtics",
"name":"Celtics"},"home_team_score":126,"period":4,"postseason":false,
"season":2018,"status":"Final","time":" ","visitor_team":{"id":4,
"abbreviation":"CHA","city":"Charlotte","conference":"East",
"division":"Southeast","full_name":"Charlotte Hornets","name":"Hornets"},
"visitor_team_score":94},{"id":48751,"date":"2019-02-09T00:00:00.000Z",
"home_team":{"id":2,"abbreviation":"BOS","city":"Boston",
"conference":"East","division":"Atlantic","full_name":"Boston Celtics",
"name":"Celtics"}}]}"""
raw_data = json.loads(data_string)
df = pd.json_normalize(raw_data['data'])
There is a lot of redundant information about the teams. Look at "home_team" for the two games, the information is the same. You could pull all the team data into a separate DataFrame and use the team "id" or "abbreviation" in the DataFrame with all the games.

How to write dataframe to csv with a single row header(5k columns)?

I am trying to export a pandas dataframe with to_csv so it can be processed by another tool before using it again with python. It is a token dataset with 5k columns. When exported the header is split in two rows. This might not be an issue for pandas but in this case I need to export it on a single row csv. Is this a pandas limitation or a csv format one?
Currently, searching returned no compatible results. The only solution I came up is writing the column names and the values separately, eg. writing an str column list first and then a numpy array to the csv. Can this be implemented, and if so how?
For me this problem was caused by having multiple indexes. The easiest way to resolve this issue is to specify your own headers. I found reference to an option called tupleize_cols but it doesn't exist in current (1.2.2) pandas.
I was using the following aggregation:
df.groupby(["device"]).agg({
"outage_length":["count","sum"],
}).to_csv("example.csv")
This resulted in the following csv output:
,outage_length,outage_length
,count,sum
device,,
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0
I specified my own headers in the call to to_csv; excluding my group_by, as follows:
}).to_csv("example.csv",header=("flaps","downtime"))
And got the following csv output, which was much more pleasing to spreadsheet software:
device,flaps,downtime
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0

Organizing column and header data with pandas, python

I'm having a go at using Numpy instead of Matlab, but I'm relatively new to Python.
My current challenge is importing the data in multiple file in a sensible way so that I can use and plot it. The data is organized in columnes (Temperature, Pressure, Time, etc, each file being a measurement period), and I decided pandas was probably the best way to import the data. I was thinking of using top-leve descriptor for each file, and subdescriptors for each column. Thought of doing it something like this.
Reading Multiple CSV Files into Python Pandas Dataframe
The problem is I'd like to retain and use some of the data in the header (for plotting, for instance). There's no column titles, but general info on data mesaurements, something like this:
Flight ID: XXXXXX
Date: 01-27-10 Time: 5:25:19
OWNER
Release Point: xx.304N xx.060E 11 m
Serial Number xxxxxx
Surface Data: 985.1 mb 1.0 C 100% 1.0 m/s # 308 deg.
I really don't know how to extract and store the data in a way that makes sense when combined with the data frame. Thought of perhaps a dictionary, but I'm not sure how to split the data efficiently since there's no consistent divider. Any ideas?
Looks like somebody is working with radiosondes...
When I pull in my radiosonde data I usually put it in a multi-level indexed dataframe. The levels could be of various forms and orders, but something like FLIGHT_NUM, DATE, ALTITUDE, etc. would make sense. Also, when working with sonde data I too want some additional information that does not necessarily need to be stored within the dataframe, so I store that as additional attributes. If I were to parse your file and then store it I would do something along the lines of this (yes, there are modifications that can be made to "improve" this):
import pandas as pd
with open("filename.csv",'r') as data:
header = data.read().split('\n')[:5] # change to match number of your header rows
data = pd.read_csv(data, skiprows=6, skipinitialspace=True, na_values=[-999,'Infinity','-Infinity'])
# now you can parse your header to get out the necessary information
# continue until you have all the header info you want/need; e.g.
flight = header[0].split(': ')[1]
date = header[1].split(': ')[1].split('')[0]
time = header[1].split(': ')[2]
# a lot of the header information will get stored as metadata for me.
# most likely you want more than flight number and date in your metadata, but you get the point.
data.metadata = {'flight':flight,
'date':date}
I presume you have a date/time column (call it "dates" here) within your file, so you can use that to re-index your dataframe. If you choose to use different variables within your multi-level index then the same method applies.
new_index = [(data.metadata['flight'],r) for r in data.dates]
data.index = pd.MultiIndex.from_tuples(new_index)
You now have a multi-level indexed dataframe.
Now, regarding your "metadata". EdChum makes an excellent point that if you copy "data" you will NOT copy over the metadata dictionary. Also, if you save "data" to a dataframe via data.to_pickle you will lose your metadata (more on this later). If you want to keep your metadata you have a couple options.
Save the data on a flight-by-flight basis. This will allow you to store metadata for each individual flight's file.
Assuming you want to have multiple flights within one saved file: you can add an additional column within your dataframe that hold that information (i.e. another column for flight number, another column for surface temperature, etc.), though this will increase the size of your saved file.
Assuming you want to have multiple flights within one saved file (option 2): You can make your metadata dictionary "keyed" by flight number. e.g.
data.metadata = {FLIGHT1:{'date':date},
FLIGHT2:{'date':date}}
Now to store the metadata. Check you my IO class on storing additional attributes within an h5 file posted here.
Your question was quite broad, so you got a broad answer. I hope this was helpful.

Categories