Read flat file to DataFrames using Pandas with field specifiers in-line

Read flat file to DataFrames using Pandas with field specifiers in-line - python

I'm attempting to read in a flat-file to a DataFrame using pandas but can't seem to get the format right. My file has a variable number of fields represented per line and looks like this:
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCinpt|MIME=application/synthesis+ssml|TXID=NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAAA-txt|TXSZ=1167|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCsynd|INPT=1167|DURS=5120|RSTT=stop|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOClise|LUSED=0|LMAX=100|OMAX=95|LFEAT=tts|UCPU=0|SCPU=0
I have the field separator at |, I've pulled a list of all unique keys into keylist, and am trying to use the following to read in the data:
keylist = ['TIME',
'CHAN',
# [truncated]
'DURS',
'RSTT']
test_fp = 'c:\\temp\\test_output.txt'
df = pd.read_csv(test_fp, sep='|', names=keylist)
This incorrectly builds the DataFrame as I'm not specifying any way to recognize the key label in the line. I'm a little stuck and am not sure which way to research -- should I be using .read_json() for example?

Not sure if there's a slick way to do this. Sometimes when the data structure is different enough from the norm it's easiest to preprocess it on the Python side. Sure, it's not as fast, but since you could immediately save it in a more standard format it's usually not worth worrying about.
One way:
with open("wfield.txt") as fp:
rows = (dict(entry.split("=",1) for entry in row.strip().split("|")) for row in fp)
df = pd.DataFrame.from_dict(rows)
which produces
>>> df
CHAN DURS EVNT INPT LFEAT LMAX LUSED \
0 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOCinpt NaN NaN NaN NaN
1 FCJNJKDCAAANPCKEAAAAAAAA 5120 NVOCsynd 1167 NaN NaN NaN
2 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOClise NaN tts 100 0
MIME OMAX RSTT SCPU TIME \
0 application/synthesis+ssml NaN NaN 15 20131203004552049
1 NaN NaN stop 15 20131203004552049
2 NaN 95 NaN 0 20131203004552049
TXID TXSZ UCPU
0 NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAA... 1167 31
1 NaN NaN 31
2 NaN NaN 0
[3 rows x 15 columns]
After you've got this, you can reshape as needed. (I'm not sure if you wanted to combine rows with the same TIME & CHAN or not.)
Edit: if you're using an older version of pandas which doesn't support passing a generator to from_dict, you can built it from a list instead:
df = pd.DataFrame(list(rows))
but note that you haev have to convert columns to numerical columns from strings after the fact.

Related

How to write pandas dataframe to a csv with varying row length

I've read in a csv in Pandas that has a variance in row values and some blank lines in between the rows.
Example:
This is an example
CustomerID; 123;
Test ID; 144;
Seen_on_Tv; yes;
now_some_measurements_1;
test1; 333; 444; 555;
test2; 344; 455; 566;
test3; 5544; 3424; 5456;
comment; this test sample is only for this Stackoverflow question, but
similar to my real data.
When reading in this file, I use this code:
pat = pd.read_csv(FileName, skip_blank_lines = False, header=None, sep=";", names=['a', 'b', 'c', 'd', 'e'])
pat.head(10)
output:
a b c d e
0 This is an example NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 CustomerID 123 NaN NaN NaN
3 Test ID 144 NaN NaN NaN
4 Seen_on_Tv yes NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 now_some_measurements_1 NaN NaN NaN NaN
7 test1 333 444.0 555.0
8 test2 344 455.0 566.0 NaN
9 test3 5544 3424.0 5456.0 NaN
This works, especially since I have to change the CustomerID (via this code example) etc:
newID = 'HASHED'
pat.loc[pat['a'] == 'CustomerID', 'b']=newID
However, when I save this changed dataframe to csv, I get a lot of 'trailing' seperators (";") as most of the columns are empty and especially with the blank lines.
pat.to_csv('out.csv', sep=";", index = False, header=False)
output (out.csv):
This is an example;;;;
;;;;
CustomerID; HASHED;;;
Test ID; 144;;;
Seen_on_Tv; yes;;;
;;;;
now_some_measurements_1;;;;
test1; 333;444.0;555.0;
test2; 344;455.0;566.0;
test3; 5544;3424.0;5456.0;
;;;;
comment; this test sample is only for this Stackoverflow question, but similar to my real
data.
;;;
I've searched almost everywhere for a solution, but can not find it.
How to write only the column values to the csv file that are not blank (except for the blank lines to separate the sections, which need to remain blank of course)?
Thank you in advance for your kind help.

A simple way would be to just parse your out.csv and for the non-blank lines (those consisting solely of ;'s) - write a stripped version of that line, eg:
with open('out.csv') as fin, open('out2.csv', 'w') as fout:
for line in fin:
if stripped := line.strip(';\n '):
fout.write(stripped + '\n')
else:
fout.write(line)
Will give you:
This is an example
;;;;
CustomerID; HASHED
Test ID; 144
Seen_on_Tv; yes
;;;;
now_some_measurements_1
test1; 333;444.0;555.0
test2; 344;455.0;566.0
test3; 5544;3424.0;5456.0
;;;;
comment; this test sample is only for this Stackoverflow question, but similar to my real
data.
;;;
You could also pass a io.StringIO object to to_csv (to save writing to disk and then re-reading) as the output destination, then parse that in a similar fashion to produce your desired output file.

Search long series for non NaN entries

I am looking through a DataFrame with different kinds of data whose usefulness I'm trying to evaluate. So I am looking at each column and check the kind of data it is. E.g.
print(extract_df['Auslagenersatz'])
For some I get responses like this:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
263 NaN
264 NaN
265 NaN
266 NaN
267 NaN
I would like to check whether that column contains any information at all so what I am looking for is something like
s = extract_df['Auslagenersatz']
print(s.loc[s == True])
where I am assuming that NaN is interpreted as False in the same way an empty set is. I would like it to return only those elements of the series that satisfy this condition (being not empty). The code above does not work however, as I get an empty set even for columns that I know have non-NaN entries.
I oriented myself with this post How to select rows from a DataFrame based on column values
but I can't figure where I'm going wrong or how to do this instead. The Problem comes up often so any help is well appreciated.

import pandas as pd
df = pd.DataFrame({'A':[2,3,None, 4,None], 'B':[2,13,None, None,None], 'C':[None,3,None, 4,None]})
If you want to see non-NA values of column A then:
df[~df['A'].isna()]

Compare and find Duplicated values (not entire columns) across data frame with python

I have a large data frame of schedules, and I need to count the numbers of experiments run. The challenge is that usage for is repeated in rows (which is ok), but is duplicated in some, but not all columns. I want to remove the second entry (if duplicated), but I can't delete the entire second column because it will contain some new values too. How can I compare individual entries for two columns in a side by side fashion and delete the second if there is a duplicate?
The duration for this is a maximum of two days, so three days in a row is a new event with the same name starting on the third day.
The actual text for the experiment names is complicated and the data frame is 120 columns wide, so typing this in as a list or dictionary isn't possible. I'm hoping for a python or numpy function, but could use a loop.
Here are pictures for an example of the starting data frame and the desired output.starting data frame example
de-duplicated data frame example

This a hack and similar to #Params answer, but would be faster because you aren't calling .iloc a lot. The basic idea is to transpose the data frame and repeat an operation for as many times as you need to compare all of the columns. Then transpose it back to get the result in the OP.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Monday':['exp_A','exp_A','exp_A','exp_A','exp_B',np.nan,np.nan,np.nan,'exp_D','exp_D'],
'Tuesday':['exp_A','exp_A','exp_A','exp_A','exp_B','exp_B','exp_C','exp_C','exp_D','exp_D'],
'Wednesday':['exp_A','exp_D',np.nan,np.nan,np.nan,'exp_B','exp_C','exp_C','exp_C',np.nan],
'Thursday':['exp_A','exp_D',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'exp_C',np.nan]
})
df = df.T
for i in range(int(np.ceil(df.shape[0]/2))):
df[(df == df.shift(1))& (df != df.shift(2))] = np.nan
df = df.T
Monday Tuesday Wednesday Thursday
0 exp_A NaN exp_A NaN
1 exp_A NaN exp_D NaN
2 exp_A NaN NaN NaN
3 exp_A NaN NaN NaN
4 exp_B NaN NaN NaN
5 NaN exp_B NaN NaN
6 NaN exp_C NaN NaN
7 NaN exp_C NaN NaN
8 exp_D NaN exp_C NaN
9 exp_D NaN NaN NaN

Save each column of a dataframe in a new dataframe while looping

I have a dataframe called 'Adj_Close' which looks like this:
AAPL TSLA GOOG
0 3.478462 NaN NaN
1 3.185191 NaN NaN
2 3.231803 NaN NaN
3 2.952128 NaN NaN
4 3.091966 NaN NaN
... ... ... ...
5005 261.779999 333.040009 1295.339966
5006 266.369995 336.339996 1306.689941
5007 264.290009 328.920013 1313.550049
5008 267.839996 331.290009 1312.989990
5009 267.250000 329.940002 1304.959961
I want to save each column ('AAPL', 'TSLA' & 'GOOG') in a new dataframe.
The code should look like this:
i = 0
n = 3
while i < n:
df_{i} = Adj_Close.iloc[:,i]
i += 1
Unfortunately it is the wrong syntax. I hope someone can help me...

The natural way to do that in python would be to create an array of dataframes, as in:
dataframes = []
for col in df.columns:
new_df = pd.DataFrame(df[col])
dataframes.append(new_df)
The result is an array (dataframes) that contains three separate data frames - one for Google, one for Tesla, and one for Apple.
[ One can also define new variables using
globals()[my_var_name] = <some_value>
But I don't believe that's what you're looking for.
]

Fastest way to add rows to existing pandas dataframe

I'm currently trying to create a new csv based on an existing csv.
I can't find a faster way to set values of a dataframe based on an existing dataframe values.
import pandas
import sys
import numpy
import time
# path to file as argument
path = sys.argv[1]
df = pandas.read_csv(path, sep = "\t")
# only care about lines with response_time
df = df[pandas.notnull(df['response_time'])]
# new empty dataframe
new_df = pandas.DataFrame(index = df["datetime"])
# new_df needs to have datetime as index
# and columns based on a combination
# of 2 columns name from previous dataframe
# (there are only 10 differents combinations)
# and response_time as values, so there will be lots of
# blank cells but I don't care
for i, row in df.iterrows():
start = time.time()
new_df.set_value(row["datetime"], row["name"] + "-" + row["type"], row["response_time"])
print(i, time.time() - start)
Original dataframe is:
datetime name type response_time
0 2018-12-18T00:00:00.500829 HSS_ANDROID audio 0.02430
1 2018-12-18T00:00:00.509108 HSS_ANDROID video 0.02537
2 2018-12-18T00:00:01.816758 HSS_TEST audio 0.03958
3 2018-12-18T00:00:01.819865 HSS_TEST video 0.03596
4 2018-12-18T00:00:01.825054 HSS_ANDROID_2 audio 0.02590
5 2018-12-18T00:00:01.842974 HSS_ANDROID_2 video 0.03643
6 2018-12-18T00:00:02.492477 HSS_ANDROID audio 0.01575
7 2018-12-18T00:00:02.509231 HSS_ANDROID video 0.02870
8 2018-12-18T00:00:03.788196 HSS_TEST audio 0.01666
9 2018-12-18T00:00:03.807682 HSS_TEST video 0.02975
new_df will look like this:
I takes 7ms per loop.
It takes an eternity to process a (only ?) 400 000 rows Dataframe. How can I make it faster ?

Indeed, using pivot will do what you look for such as:
import pandas as pd
new_df = pd.pivot(df.datetime, df.name + '-' + df.type, df.response_time)
print (new_df.head())
HSS_ANDROID-audio HSS_ANDROID-video \
datetime
2018-12-18T00:00:00.500829 0.0243 NaN
2018-12-18T00:00:00.509108 NaN 0.02537
2018-12-18T00:00:01.816758 NaN NaN
2018-12-18T00:00:01.819865 NaN NaN
2018-12-18T00:00:01.825054 NaN NaN
HSS_ANDROID_2-audio HSS_ANDROID_2-video \
datetime
2018-12-18T00:00:00.500829 NaN NaN
2018-12-18T00:00:00.509108 NaN NaN
2018-12-18T00:00:01.816758 NaN NaN
2018-12-18T00:00:01.819865 NaN NaN
2018-12-18T00:00:01.825054 0.0259 NaN
HSS_TEST-audio HSS_TEST-video
datetime
2018-12-18T00:00:00.500829 NaN NaN
2018-12-18T00:00:00.509108 NaN NaN
2018-12-18T00:00:01.816758 0.03958 NaN
2018-12-18T00:00:01.819865 NaN 0.03596
2018-12-18T00:00:01.825054 NaN NaN
and to not have NaN, you can use fillna with any value you want such as:
new_df = pd.pivot(df.datetime, df.name +'-'+df.type, df.response_time).fillna(0)

you can also use unstack as well just another option
new = df.set_index(['type','name', 'datetime']).unstack([0,1])
new.columns = ['{}-{}'.format(z,y) for x,y,z, in new.columns]
using f-strings will be a little faster than format:
new.columns = [f'{z}-{y}' for x,y,z, in new.columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read flat file to DataFrames using Pandas with field specifiers in-line - python

Related

How to write pandas dataframe to a csv with varying row length

Search long series for non NaN entries

Compare and find Duplicated values (not entire columns) across data frame with python

Save each column of a dataframe in a new dataframe while looping

Fastest way to add rows to existing pandas dataframe

Categories

Resources