iteratively read (tsv) file for Pandas DataFrame

iteratively read (tsv) file for Pandas DataFrame - python

I have some experimental data which looks like this - http://paste2.org/YzJL4e1b (too long to post here). The blocks which are separated by field name lines are different trials of the same experiment - I would like to read everything in a pandas dataframe but have it bin together certain trials (for instance 0,1,6,7 taken together - and 2,3,4,5 taken together in another group). This is because different trials have slightly different conditions and I would like to analyze the results difference between these conditions. I have a list of numbers for different conditions from another file.
Currently I am doing this:
tracker_data = pd.DataFrame
tracker_data = tracker_data.from_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)
tracker_data['GazePointXLeft'] = tracker_data['GazePointXLeft'].astype(np.float64)
but this of course just reads everything in one go (including the field name lines) - it would be great if I could nest the blocks somehow which allows me to easily access them via numeric indices...
Do you have any ideas how I could best do this?

You should use read_csv rather than from_csv*:
tracker_data = pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)
If you want to join a list of DataFrames like this you could use concat:
trackers = (pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4) for i in range(?))
df = pd.concat(trackers)
* which I think is deprecated.

I haven't quite got it working, but I think that's because of how I copy/pasted the data. Try this, let me know if it doesn't work.
Using some inspiration from this question
pat = "TimeStamp\tGazePointXLeft\tGazePointYLeft\tValidityLeft\tGazePointXRight\tGazePointYRight\tValidityRight\tGazePointX\tGazePointY\tEvent\n"
with open('rec.txt') as infile:
header, names, tail = infile.read().partition(pat)
names = names.split() # get rid of the tabs here
all_data = tail.split(pat)
res = [pd.read_csv(StringIO(x), sep='\t', names=names) for x in all_data]
We read in the whole file so this won't work for huge files, and then partition it based on the known line giving the column names. tail is just a string with the rest of the data so we can split that, again based on the names. There may be a better way than using StringIO, but this should work.
I'm note sure how you want to join the separate blocks together, but this leaves them as a list. You can concat from there however you desire.
For larger files you might want to write a generator to read until you hit the column names and write a new file until you hit them again. Then read those in separately using something like Andy's answer.
A separate question from how to work with the multiple blocks. Assuming you've got the list of Dataframes, which I've called res, you can use pandas' concat to join them together into a single DataFrame with a MultiIndex (also see the link Andy posted).
In [122]: df = pd.concat(res, axis=1, keys=['a', 'b', 'c']) # Use whatever makes sense for the keys
In [123]: df.xs('TimeStamp', level=1, axis=1)
Out[123]:
a b c
0 NaN NaN NaN
1 0.0 0.0 0.0
2 3.3 3.3 3.3
3 6.6 6.6 6.6

I ended up doing it iteratively. very very iteratively. Nothing else seems to work.
pat = 'TimeStamp GazePointXLeft GazePointYLeft ValidityLeft GazePointXRight GazePointYRight ValidityRight GazePointX GazePointY Event'
with open(bhpath+fileid+'_wmet.tsv') as infile:
eye_data = infile.read().split(pat)
eye_data = [trial.split('\r\n') for trial in eye_data] # split at '\r'
for idx, trial in enumerate(eye_data):
trial = [row.split('\t') for row in trial]
eye_data[idx] = trial

Related

Fixing broken naming after merging a groupby pivot_table dataframe

I have a problem with naming of columns of dataframe resulting from merging it with its iteration created by group_by.
Generally, the code that creates the mess looks like this:
volume_aggrao = volume.groupby(by = ['room_name', 'material', 'RAO']).sum()['quantity']
volume_aggrao_concat = pd.pivot_table(pd.DataFrame(volume_aggrao), index=['room_name', 'material'], columns = ['RAO'], values = ['quantity'])
volume = volume.merge(volume_aggrao_concat, how = 'left', on = ['room_name', 'material'])
Now to what it does: the goal of pivot_table is to show 'quantity' variable sum over each category of 'RAO' and it looks like that:
And it is fine until you access how it looks on the inside:
"('room_name', '')","('material', '')","('quantity', 'moi')","('quantity', 'nao')","('quantity', 'onrao')","('quantity', 'prom')","('quantity', 'sao')"
1,aluminum,NaN,13.0,NaN,NaN,NaN
1,concrete,151.0,NaN,NaN,NaN,NaN
1,plastic,56.0,NaN,NaN,NaN,NaN
1,steel_mark_1,NaN,30.0,2.0,NaN,1.0
1,steel_mark_2,52.0,NaN,88.0,NaN,NaN
2,aluminum,123.0,NaN,84.0,NaN,NaN
2,concrete,155.0,NaN,NaN,30.0,NaN
2,plastic,170.0,NaN,NaN,NaN,NaN
2,steel_mark_1,107.0,NaN,105.0,47.0,NaN
2,steel_mark_2,81.0,41.0,NaN,NaN,NaN
3,aluminum,NaN,NaN,90.0,NaN,79.0
3,concrete,NaN,82.0,NaN,NaN,NaN
3,plastic,1.0,NaN,25.0,NaN,NaN
3,steel_mark_1,116.0,10.0,NaN,136.0,NaN
3,steel_mark_2,NaN,92.0,34.0,NaN,NaN
4,aluminum,50.0,74.0,NaN,NaN,88.0
4,concrete,96.0,NaN,27.0,NaN,NaN
4,plastic,63.0,135.0,NaN,NaN,NaN
4,steel_mark_1,97.0,NaN,28.0,87.0,NaN
4,steel_mark_2,57.0,22.0,7.0,NaN,NaN
Nevertheless, I was still able to merge it, with resulting columns being named automatically like that:
I cannot seem to be able to call these '(quantity, smth)' columns and hence could not even rename them directly. And there i decided to fully reset column namings with volume.columns = ["id", "room_name", "material", "alpha_UA", "beta_UA", "alpha_F", "beta_F", "gamma_EP", "quantity", "files_id", "all_UA", "RAO", "moi", "nao", "onrao", "prom", "sao"], which is indeed bulky, but it worked. Except it did not when one or more of categorical values of "RAO" is missing. For example, there is no "nao" in "RAO" and hence there is no such column created and hence the code has nothing to rename.
I tried fixing it with volume.rename(lambda x: x.lstrip("(\'quantity\',").strip("\'() \'") if "(" in x else x, axis=1), but it seems to do nothing with them.
I want to know if there is a way to rename these columns.
Data
Here's some example data of 'volume' dataframe you may use to replicate the process with desired output embedded in it to compare
"id","room_name","RAO","moi","nao","onrao","prom","sao"
"1","3","onrao","1","","25","",""
"2","4","nao","57","22","7","",""
"4","2","moi","170","","","",""
"6","4","moi","97","","28","87",""
"7","4","moi","97","","28","87",""
"11","1","nao","","13","","",""
"12","4","onrao","97","","28","87",""
"13","2","moi","107","","105","47",""
"18","2","moi","123","","84","",""
"19","2","moi","155","","","30",""
"22","2","moi","170","","","",""
"23","4","sao","50","74","","","88"
"24","4","nao","50","74","","","88"

So, after a cup of coffee and a cold shower, I was able to investigate a bit further and found out that the strange namings are actually tuples and not strings! Knowing that I decided to iterate over columns to change them to strings and then use the filter. A bit bulky once again, but here is a solution:
for name in volume.columns:
names.append(str(name).lstrip("(\'quantity\',").strip("\'() \'"))

RealTime data appending - Pandas

I am trying to do something very basic in pandas and failing miserably.
From a high level I am taking ask_size data from my broker who passes the value to me on every tick update.
I can print out the last value easily enough.
All I am trying to do is append the next ask_size amount to the previous ask_size, to the end of a df in a new row, so I can do some historical analysis.
def getTickSize():
askSize_list = [] # empty list
askSize_list.append(float(ask_size)) # getting askSize and putting it in a list
datagrab = {'ask_size': askSize_list} # creating the single column and putting askSize in
df = pd.DataFrame(datagrab) # using a pd df
print(df.tail(10))
I am then calling the function in a different part of my script
However the output always only shows the last askSize:
askSize
0 30.0
And never actually appends the real-time data
Clearly I am doing something wrong, but I am at a loss to what.
I have also tried using the ignore_index=True in a second df, refencing the first, but no joy:
askSize
0 30.0
1 30.0
I have also tried using 'for loops' but as there doesn't seem to be anything to iterate over (data is real-time) I came to a dead end
(note I will also eventually add a timestamp to each new ask_size as it is appended to the list. So only 2 columns, in the end)
Any help is much appreciated

it seems you are creating a new dataframe, not appending new data.
You could, for example, create a new dataframe that will be appended to the existing data frame with the row(s) in the same format.
Lets say you have already df created. You want to add 1 new entry that will be read as a parameter (if you need more, specify more parameters), here is a basic example:
'askSize'
1.0
2.0
def append_row(newdata, dataframe):
row = {'ask_size': [newdata]}
temp_df = pd.DataFrame(row)
# merge original dataframe with temp_df
merged_df = pd.concat([dataframe, temp_df])
return merged_df
df = append_row("5.1", df) # this will overwrite your original df
'askSize'
1.0
2.0
5.1
You would need to call the function to add a new row (for instance calling it from inside a loop or any other part of the code).
You can also use df.append() and other methods, here are some links that could be useful for your use case:
Merge, join, concatenate and compare (Pandas.pydata.org)
Example of using pd.append() (Pandas.pydata.org)

Setting category and type for multiple columns possible?

I have a dataset which contains 6 columns TIME1 to TIME6, amongst others. For each of these I need to apply the code below (which is shown for 2 columns). LISTED is a prepared list of the possible elements to be seen in these columns.
Is there a way to do this without writing the same 2 lines 6 times?
df['PART1'] = df['TIME1'].astype('category')
df['PART1'].cat.set_categories(LISTED, inplace=True)
df['PART2'] = df['TIME2'].astype('category')
df['PART2'].cat.set_categories(LISTED, inplace=True)
For astype(first line of code), I tried the following:
for col in ['TIME1', 'TIME2', 'TIME3', 'TIME4', 'TIME5', 'TIME6']:
df_col = df[col].astype('category')
I think this works (not sure how to check without the whole code working). But how could I do something similar for the second line of code with the set_categories etc?
In short, I'm looking for something short/more elegant that just copying and modifying the same 2 lines 6 times.
I am new to python, any help is greatly appreciated.
Using python 2.7 and pandas 0.24.2

Yes it is possible! We can change the dtype of multiple columns to categorical is one go by creating CategoricalDtype
i = pd.RangeIndex(1, 7).astype(str)
df['PART' + i] = df['TIME' + i].astype(pd.CategoricalDtype(LISTED))

How to sort a csv file by column

I need to sort a .csv file in a very specific way but have pretty limited knowledge of python, i have got some code that works but it doesnt really do exactly what i want it to do, the format is as follows {header} {header} {header} {header}
{dataA} {dataB} {datac} {dataD}
In the csv whatever dataA is it is usually repeated 100-200 times, is there a way in which i can get dataA (e.g: examplecompany) and tell me how many times it repeats then how many times dataC repeats with dataA as the first item in the row. for example the output might be examplecompany appeared 100 times, out of those 100 datac1 appeared 45 times and datac2 appeared 55 I'm really terrible at explaining things, any help would be appreciated.

You can use csv.DictReader to read the file and then sort for the key you want.
from csv import DictReader
with open("test.csv") as f:
reader = DictReader(f)
sorted_rows = sorted(list(reader), key=lambda x: x["column1"])
CSV file I tested it with (test.csv):
column1,column2
2,bla
1,blubb

It is not clear what do you want to accomplish since you have not provided any code or a complete example of input/output for your problem.
For me, it seems that you want to count certain occurrences of data in headerC for each unique data in headerA.
Suppose you have the following .csv file:
headerA,headerB,headerC,headerD
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany2,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac3,datad
You can accomplish this counting with pandas. Following is an example of how you might do it.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df.groupby(['headerA'])['headerC'].value_counts()
headerA headerC
examplecompany1 datac1 3
datac2 2
datac3 1
examplecompany2 datac2 2
datac1 1
Name: headerC, dtype: int64
Here, groupby will group the DataFrame using headerA as a reference. You can group by a single Series or a list of Series. After that, the square bracket notation is used to access the headerC column and value_counts will count each occurrence of headerC that was previously grouped by headerA. Afterwards you can just format the output for what you want.
Edit:
I forgot that you also wanted to get the number of occurrences of headerA, but that is really simple since you can get it directly by selecting the headerA column on the DataFrame df and call value_counts on it.

How to stream in and manipulate a large data file in python

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:
Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0
To:
Geography Count
County1 15
County2 23
This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.
Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.
Edit: In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.

You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:
import dask.dataframe as dd
df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by #chrisaycock. You may want to experiment with the chunksize parameter.
# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)
# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

I do like #root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:
cols = ['Geography','Count']
df = pd.DataFrame()
chunksize = 2 # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
usecols=cols,
chunksize=chunksize)
):
# merge previously aggregated DF with a new portion of data and aggregate it again
df = (pd.concat([df,
chunk.groupby('Geography')['Count'].sum().to_frame()])
.groupby(level=0)['Count']
.sum()
.to_frame()
)
df.reset_index().to_csv('c:/temp/result.csv', index=False)
test data:
Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111
output.csv:
Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111
PS using this approach will you can process huge files.
PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first
I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

iteratively read (tsv) file for Pandas DataFrame - python

Related

Fixing broken naming after merging a groupby pivot_table dataframe

RealTime data appending - Pandas

Setting category and type for multiple columns possible?

How to sort a csv file by column

How to stream in and manipulate a large data file in python

Categories

Resources