Slicing DAT file by Fixed Width Stored in Dict - python

I am having some trouble (been trying this for long time) and still couldn't get solution on my own. I have a dat file that looks like this format:
abc900800007.2
And I have a dict that contains the column name as key and the values corresponding to the fixed width for the DAT file, my dict goes like mydict = {'col1': 3, 'col2': 8, 'col3': 3).
What I want to do is to create a df by combining both item, so slicing the DAT file through the dict value. The df should be like:
col1 col 2 col 3
abc 90080000 7.2
Any help would be highly appreciated!

I think a possible (but depending on the file size memory intensive) solution is:
data = {'col1':[], 'col2':[], 'col3':[]}
for line in open('file.dat'):
data['col1'].append(line[:mydict['col1']])
begin = mydict['col1']
end = begin + mydict['col2']
data['col2'].append(line[begin:end])
begin = end
end = begin + mydict['col3']
data['col3'].append(line[begin:end])
df = pd.DataFrame(data) # create the DataFrame
del data # delete the auxiliar data

Related

Compare data between two csv files and count how many rows have the same data

Let's say I have list of all OUs (AllOU.csv):
NEWS
STORE
SPRINKLES
ICECREAM
I want to look through a csv file (samplefile.csv) on the third column called 'column3', and search through each row if it matches what is in the samplefile.csv.
Then I want to sort them and count how many rows each one has.
This is how the column looks:
column3
CN=Clark Kent,OU=news,dc=company,dc=com
CN=Mary Poppins,OU=ice cream, dc=company,dc=com
CN=Mary Jane,OU=news,OU=tv,dc=company,dc=com
CN=Pepper Jack,OU=store,OU=tv,dc=company,dc=com
CN=Monty Python,OU=store,dc=company,dc=com
CN=Anne Potts,OU=sprinkles,dc=company,dc=com
I want to sort them out like this (or a list):
CN=Clark Kent,OU=news,dc=company,dc=com
CN=Mary Jane,OU=news,OU=tv,dc=company,dc=com
CN=Pepper Jack,OU=tv,OU=store,dc=company,dc=com
CN=Monty Python,OU=store,dc=company,dc=com
CN=Mary Poppins,OU=ice cream, dc=company,dc=com
CN=Anne Potts,OU=sprinkles,dc=company,dc=com
This is what the final output should be:
2, news
2, store,
1, icecream
1, sprinkles
Maybe a list would be a good way of sorting them? Like this?
holdingList =['CN=Clark Kent,OU=news,dc=company,dc=com','CN=Mary Jane,OU=news,OU=tv,dc=company,dc=com'],
['CN=Pepper Jack,OU=tv,OU=store,dc=company,dc=com','CN=Monty Python,OU=store,dc=company,dc=com'],
['CN=Mary Poppins,OU=ice cream, dc=company,dc=com'],
['CN=Anne Potts,OU=sprinkles,dc=company,dc=com']
I had something like this so far:
file = open('samplefile.csv')
df = pd.read_csv(file, usecols=['column3'])
#file of all OUs
file2 = open('ALLOU.csv')
OUList = pd.read_csv(file2, header=None)
for OU in OUList[0]:
df_dept = df[df['column3'].str.contains(f'OU={OU }')].count()
print({OU}, df_dept)
Read your file first and create a list of objects.
[{CN:’Clark Kent’,OU:’news’,dc:’company’,dc:’com’},…{…}]
Once you have created the list you can convert it to data frame and then apply all the grouping, sorting and other abilities of pandas.
Now to achieve this, first read your file into a variable lets call var filedata=yourFileContents. Next split filedata. var lines = filedata.split(‘\n’)
Now loop over each lines
dataList = []
for line in lines:
item = dict()
elements = line.split(‘,’)
for element in elements:
key_value = element.split(‘=‘)
item[key_value[0]] = key_value[1]
dataList.append(item)
print(dataList)
Now you may load this onto a panda dataframe and apply sorting and grouping. Once you have structured the data frame, you can simply search the key from the other file in this dataframe and get your numbers

Adding a pandas.dataframe to another one with it's own name

I have data that I want to retrieve from a couple of text files in a folder. For each file in the folder, I create a pandas.DataFrame to store the data. For now it works correctly and all the fils has the same number of rows.
Now what I want to do is to add each of these dataframes to a 'master' dataframe containing all of them. I would like to add each of these dataframes to the master dataframe with their file name.
I already have the file name.
For example, let say I have 2 dataframes with their own file names, I want to add them to the master dataframe with a header for each of these 2 dataframes representing the name of the file.
What I have tried now is the following:
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame()
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_name = getFileName(file)
t0_data.insert(loc=len(t0_data.columns), column=file_name, value=file_data)
Could someone help me with this please?
Thank you :)
Edit:
I think I was not clear enough, this is what I am expecting as an output:
output
You may be looking for the concat function. Here's an example:
import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3], 'Col2': [4, 5, 6]})
B = pd.DataFrame({'Col1': [7, 8, 9], 'Col2': [10, 11, 12]})
a_filename = 'a_filename.txt'
b_filename = 'b_filename.txt'
A['filename'] = a_filename
B['filename'] = b_filename
C = pd.concat((A, B), ignore_index = True)
print(C)
Output:
Col1 Col2 filename
0 1 4 a_filename.txt
1 2 5 a_filename.txt
2 3 6 a_filename.txt
3 7 10 b_filename.txt
4 8 11 b_filename.txt
5 9 12 b_filename.txt
There are a couple changes to make here in order to make this happen in an easy way. I'll list the changes and reasoning below:
Specified which columns your master DataFrame will have
Instead of using some function that it seems like you were trying to define, you can simply create a new column called "file_name" that will be the filepath used to make the DataFrame for every record in that DataFrame. That way, when you combine the DataFrames, each record's origin is clear. I commented that you can make edits to that particular portion if you want to use string methods to clean up the filenames.
At the end, don't use insert. For combining DataFrames with the same columns (a union operation if you're familiar with SQL or with set theory), you can use the append method.
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame(columns=['wavelength', 'max', 'min','file_name'])
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_data['file_name'] = file #You can make edits here
t0_data = t0_data.append(file_data,ignore_index=True)

Python: Add rows with different column names to dict/dataframe

I want to add data (dictionaries) to a dictionary, where every added dictionary represent a new row. It is a iterative process and it is not known what column names a new added dictionary(row) could have. In the end I want a pandas dataframe. Furthermore I have to write the dataframe every 1500 rows to a file ( which is a problem, because after 1500 rows, it could of course happen that new data is added which has columns that are not present in the already written 1500 rows to the file).
I need a approach which is very fast (maybe 26ms per row). My approach is slow, because it has to check every data if it has new column names and in the end it has to reread the file, to create a new file where all columns have the same lengths. The data comes from a queue which is processed in another process.
import pandas as pd
def writingData(exportFullName='path', buffer=1500, maxFiles=150000, writingQueue):
imagePassed = 0
with open(exportFullName, 'a') as f:
columnNamesAllList = []
columnNamesAllSet = set()
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
columnNamesUpdated = False
for data in iter(writingQueue.get, "STOP"):
print(imagesPassed)
dfTemp = pd.DataFrame([data],index=[imagesPassed])
if set(dfTemp).difference(columnNamesAllSet):
columnNamesAllSet.update(set(dfTemp))
columnNamesAllList.extend(list(dfTemp))
columnNamesUpdated = True
else:
columnNamesUpdated = False
if columnNamesUpdated:
print('Updated')
dfTempAll = dfTemp.combine_first(dfTempAll)
else:
dfTempAll.iloc[imagesPassed - 1] = dfTemp.iloc[0]
imagesPassed += 1
if imagesPassed == buffer:
dfTempAll.dropna(how='all', inplace=True)
dfTempAll.to_csv(f, sep='\t', header=True)
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
imagePassed = 0
Reading it in again:
dfTempAll = pd.DataFrame( index=range(maxFiles), columns=columnNamesAllList)
for number, chunk in enumerate(pd.read_csv(exportFullName, delimiter='\t', chunksize=buffer, low_memory=True, memory_map=True,engine='c')):
dfTempAll.iloc[number*buffer:(number+1*buffer)] = pd.concat([chunk, columnNamesAllList]).values#.to_csv(f, sep='\t', header=False) # , chunksize=buffer
#dfTempAll = pd.concat([chunk, dfTempAll])
dfTempAll.reset_index(drop=True, inplace=True).to_csv(exportFullName, sep='\t', header=True)
Small example with dataframes
So to make it clear. Lets say I have a 4 row already existent dataframe (in the real case it could have 150000 rows like in the code above), where 2 rows are already filled with data and I add a new row it could look like this with the exception that the new data is a dictionary in the raw input:
df1 = pd.DataFrame(index=range(4),columns=['A','B','D'], data={'A': [1, 2, 'NaN', 'NaN'], 'B': [3, 4,'NaN', 'NaN'],'D': [3, 4,'NaN', 'NaN']})
df2 = pd.DataFrame(index=[2],columns=['A','C','B'], data={'A': [0], 'B': [0],'C': [0] })#

Extract nested JSON embedded as string in Pandas dataframe

I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.
My CSV:
id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
Note that not all keys are the same for all the rows.
I'd like it to produce a data frame equivalent to this:
data = {'id' : [1, 2, 3],
'dist' : [67, 34, 98],
'loc_lat': [45.7, 46.89, 45.70],
'loc_lon': [38.9, 36.7, 31.0],
'arrival': ["Monday", "Tuesday", "NA"],
'characteristics_body_color':["red", "blue", "yellow"],
'characteristics_body_make':["sedan", "sedan", "NA"],
'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)
(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )
What I've tried
After a lot of futzing around, I came up with the following solution:
#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")
#Parsing function
def parse_request(s):
sj = json.loads(s)
norm = json_normalize(sj)
return norm
#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])
#Loop through and parse JSON in each row
for i in df_raw.json_request:
parsed = parsed.append(parse_request(i))
#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)
This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?
Where I've looked
I've gone through the following related questions:
Reading a CSV into pandas where one column is a json string
(which seems to only work for simple, non-nested JSON)
JSON to pandas DataFrame
(I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)
I'm using Python 3.3 and Pandas 0.17.
Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:
import csv
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('sample.csv') as fh:
rows = csv.reader(fh, delimiter='|')
header = next(rows)
# "transpose" the data. `data` is now a tuple of strings
# containing JSON, one for each row
idents, dists, data = zip(*rows)
data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists
So that:
>>> print(df)
arrival characteristics.body.color characteristics.body.make \
0 Monday red sedan
1 Tuesday blue sedan
2 NaN yellow NaN
characteristics.manuf_year loc.lat loc.lon ids
0 2014 45.70 38.9 1
1 2014 46.89 36.7 2
2 2010 45.70 31.0 3
Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:
def flatten(dct, separator='_'):
"""A fast way to flatten a dictionary,"""
res = {}
queue = [('', dct)]
while queue:
prefix, d = queue.pop()
for k, v in d.items():
key = prefix + k
if not isinstance(v, dict):
res[key] = v
else:
queue.append((key + separator, v))
return res
def records_from_json(fh):
"""Yields the records from a file object."""
rows = csv.reader(fh, delimiter='|')
header = next(rows)
for ident, dist, data in rows:
rec = flatten(json.loads(data))
rec['id'] = ident
rec['dist'] = dist
yield rec
def from_records(path):
with open(path) as fh:
return pd.DataFrame.from_records(records_from_json(fh))
And here are the results of a timing experiment where I artificially increased the size of your sample data by repeating rows. The number of lines is denoted by n_rows:
method 1 (s) method 2 (s) original time (s)
n_rows
96 0.008217 0.002971 0.362257
192 0.014484 0.004720 0.678590
384 0.027308 0.008720 1.373918
768 0.055644 0.016175 2.791400
1536 0.105730 0.030914 5.727828
3072 0.209049 0.060105 11.877403
Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.

Import CSV file to List, use file name as identifier

Thank you for helping me clarify my question as well. Two sets of code below.
First retrieves data from online data source, adds stock symbol as identifier, "AA" in output example below, and creates list with downloaded data, works perfect.
stocks = ['AA', 'AAPL', 'IBM']
start = datetime(1990, 1, 1)
end = datetime.today()
data = {}
for stock in stocks:
print stock
stkd = DataReader(stock, 'yahoo', start, end).sort_index()
data[stock] = stkd
Output:
**{'AA':** OPEN HIGH LOW CLOSE VOLUME
Date
1990-01-02 75.00 75.62 74.25 75.62 4039200
1990-01-03 76.00 76.75 76.00 76.75 7332000
Second reads CSV files and creates list, just fine, goal is to add identifier(using CSV file name), similar to code above, as data is imported and List is created.
Code for CSV read.
path =r'C:\Users\Data'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list = []
for file in allFiles:
df = pd.read_csv(file, index_col=0)
list.append(df)
frame = pd.concat(list)
Current Output:
[ Time Open High Low Close Vol OI
Date
12/17/1984 11:15 817.75 820.25 817.00 820.25 73445 309260
12/18/1984 11:15 820.25 821.00 818.50 819.25 87505 308240
Desired Output:
{'XX': Time Open High Low Close Vol OI
Date
12/17/1984 11:15 817.75 820.25 817.00 820.25 73445 309260
12/18/1984 11:15 820.25 821.00 818.50 819.25 87505 308240
Would like read XX.csv, make XX identifier for incoming values and then repeat process with YY.csv, GG.CSV into one combined list or panel.
Have tried several things without much luck, I'm new to python but have gotten along fairly well thanks to Stackoverflow and like sites.
CSV file format
Date,Time,Open,High,Low,Close,Vol,OI
12/17/1984,11:15,817.75,820.25,817,820.25,73445,309260
12/18/1984,11:15,820.25,821,818.5,819.25,87505,308240
The output you are seeing in the first case is a dictionary with string keys and DataFrame values. A minimal example:
import pandas
data = {}
for key in ['A', 'B', 'C']:
data[key] = pandas.DataFrame({'Column': [1]})
print data
Output:
{'A': Column
0 1, 'B': Column
0 1, 'C': Column
0 1}
In your second case you are using a list. Just for future reference, you shouldn't use the name list, as it shadows the built-in list constructor.
The example I had before can be redone with lists:
data = []
for key in ['A', 'B', 'C']:
data.append(pandas.DataFrame({'Column': [1]}))
print data
Output:
[ Column
0 1, Column
0 1, Column
0 1]
So, to match your first case, you should use a dictionary for your CSV files rather than a list, something like this:
data = {} # note I've changed list to data, and used {} instead of []
for file in allFiles:
df = pd.read_csv(file, index_col=0)
data[file] = df # Here I've changed the way of adding the data
You could just add the CSV file name after the list has been created.
After the for loop and before you define frame, replace the left bracket ('[') with whatever you want.
So... (note line 4)
for file in allFiles:
df = pd.read_csv(file, index_col=0)
list.append(df)
list[0].replace("[", "**{'XX':**")
frame = pd.concat(list)
I'm not saying this is the cleanest way, but it will work. I'm also not sure what you were saying about changing '[' to '{' but that change isn't required (you certainly can though).
Good luck!

Categories