Pandas Regex: Read specific columns only from csv with regex patterns - python

Given a large CSV file(large enough to exceed RAM), I want to read only specific columns following some patterns. The columns can be any of the following: S_0, S_1, ...D_1, D_2 etc. For example, a chunk from the data frame looks like this:
And the regex pattern would be for example anyu column that starts with S: S_\d.*.
Now, how do I apply this with pd.read_csv(/path/, __) to read the specific columns as mentioned?

You can first read few rows and try DataFrame.filter to get possible columns
cols = pd.readcsv('path', nrows=10).filter(regex='S_\d*').columns
df = pd.readcsv('path', usecols=cols)

Took the same approach(as of now) as mentioned in the comments. Here goes the detailed piece I used:
def extract_col_names(all_cols, pattern):
result = []
for col in all_cols:
if re.match(pattern, col):
result.append(col)
else:
continue
return result
extract_col_names(cols, pattern="S_\d+")
And it works!
But without this work-around, say even loading the columns is heavy enough itself. So, does there exist any method to parse regex patterns at the time of reading CSVs? This still remains a question.
Thanks for the response :)

Related

Python loop to search multiple sets of keywords in all columns of dataframe

I've used the code below to search across all columns of my dataframe to see if each row has the word "pool" and the words "slide" or "waterslide".
AR11AR11_regex = r"""
(?=.*(?:slide|waterslide)).*pool
"""
f = lambda x: x.str.findall(AR_regex, flags= re.VERBOSE|re.IGNORECASE)
d['AR'][AR11] = d['AR'].astype(str).apply(f).any(1).astype(int)
This has worked fine but when I want to write a for loop to do this for more than one regex pattern (e.g., AR11, AR12, AR21) using the code below, the new columns are all zeros (i.e., the search is not finding any hits)
for i in AR_list:
print(i)
pat = i+"_regex"
print(pat)
f = lambda x: x.str.findall(i+"_regex", flags= re.VERBOSE|re.IGNORECASE)
d['AR'][str(i)] = d['AR'].astype(str).apply(f).any(1).astype(int)
Any advice on why this loop didn't work would be much appreciated!
A small sample data frame would help understand your question. In any case, your code sample appears to have a multitude of problems.
i+"_regex" is just the string "AR11_regex". It won't evaluate to the value of the variable with the identifier AR11_regex. Put your regex patterns in a dict.
d['AR'] is the values in the AR column. It seems like you expect it to be a row.
d['AR'][str(i)] is adding a new row. It seems like you want to add a new column.
Lastly, this approach to setting a cell generally (always for me) yields the following warning:
/var/folders/zj/pnrcbb6n01z2qv1gmsk70b_m0000gn/T/ipykernel_13985/876572204.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The suggest approach would be to use "at" as in d.at[str(i), 'AR'] or some such.
Add a sample data frame and refine your question for more suggestions.

Extract regex matches, and not groups, in data frames rows in Python

I am a novice in coding and I generally use R for this (stringr) but I started to learn also Python's syntax.
I have a data frame with one column generated from an imported excel file. The values in this column contain both capital and smallcase characters, symbols and numbers.
I would like to generate a second column in the data frame containing only some of these words included in the first column according to a regex pattern.
df = pd.DataFrame(["THIS IS A TEST 123123. s.m.", "THIS IS A Test test 123 .s.c.e", "TESTING T'TEST 123 da."],columns=['Test'])
df
Now, to extract what I want (words in capital case), in R I would generally use:
df <- str_extract_all(df$Test, "\\b[A-Z]{1,}\\b", simplify = FALSE)
to extract the matches of the regular expression in different data frame rows, which are:
* THIS IS A TEST
* THIS IS A
* TESTING T TEST
I couldn't find a similar solution for Python, and the closest I've got to is the following:
df["Name"] = df["Test"].str.extract(r"(\b[A-Z]{1,}\b)", expand = True)
Unfortunately this does not work, as it exports only the groups rather than the matches of the regex. I've tried multiple strategies, but also str.extractall does not seem to work ("TypeError: incompatible index of inserted column with frame index)
How can I extract the information I want with Python?
Thanks!
If I understand well, you can try:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)")
.unstack().fillna('').apply(' '.join, 1)
[EDIT]:
Here is a shorter version I discovered by looking at the doc:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)").unstack(fill_value='').apply(' '.join, 1)
You are on the right track of getting the pattern. This solution uses regular expression, join and map.
df['Name'] = df['Test'].map(lambda x: ' '.join(re.findall(r"\b[A-Z\s]+\b", x)))
Result:
Test Name
0 THIS IS A TEST 123123. s.m. THIS IS A TEST
1 THIS IS A Test test 123 .s.c.e THIS IS A
2 TESTING T'TEST 123 da. TESTING T TEST

How to sort a csv file by column

I need to sort a .csv file in a very specific way but have pretty limited knowledge of python, i have got some code that works but it doesnt really do exactly what i want it to do, the format is as follows {header} {header} {header} {header}
{dataA} {dataB} {datac} {dataD}
In the csv whatever dataA is it is usually repeated 100-200 times, is there a way in which i can get dataA (e.g: examplecompany) and tell me how many times it repeats then how many times dataC repeats with dataA as the first item in the row. for example the output might be examplecompany appeared 100 times, out of those 100 datac1 appeared 45 times and datac2 appeared 55 I'm really terrible at explaining things, any help would be appreciated.
You can use csv.DictReader to read the file and then sort for the key you want.
from csv import DictReader
with open("test.csv") as f:
reader = DictReader(f)
sorted_rows = sorted(list(reader), key=lambda x: x["column1"])
CSV file I tested it with (test.csv):
column1,column2
2,bla
1,blubb
It is not clear what do you want to accomplish since you have not provided any code or a complete example of input/output for your problem.
For me, it seems that you want to count certain occurrences of data in headerC for each unique data in headerA.
Suppose you have the following .csv file:
headerA,headerB,headerC,headerD
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany2,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac3,datad
You can accomplish this counting with pandas. Following is an example of how you might do it.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df.groupby(['headerA'])['headerC'].value_counts()
headerA headerC
examplecompany1 datac1 3
datac2 2
datac3 1
examplecompany2 datac2 2
datac1 1
Name: headerC, dtype: int64
Here, groupby will group the DataFrame using headerA as a reference. You can group by a single Series or a list of Series. After that, the square bracket notation is used to access the headerC column and value_counts will count each occurrence of headerC that was previously grouped by headerA. Afterwards you can just format the output for what you want.
Edit:
I forgot that you also wanted to get the number of occurrences of headerA, but that is really simple since you can get it directly by selecting the headerA column on the DataFrame df and call value_counts on it.

Pandas dataframe to_csv - split into multiple output files

What is the best /easiest way to split a very large data frame (50GB) into multiple outputs (horizontally)?
I thought about doing something like:
stepsize = int(1e8)
for id, i in enumerate(range(0,df.size,stepsize)):
start = i
end = i + stepsize-1 #neglect last row ...
df.ix[start:end].to_csv('/data/bs_'+str(id)+'.csv.out')
But I bet there is a smarter solution out there?
As noted by jakevdp, HDF5 is a better way to store huge amounts of numerical data, however it doesn't meet my business requirements.
This answer brought me to a satisfying solution using:
numpy.array_split(object, number_of_chunks)
for idx, chunk in enumerate(np.array_split(df, number_of_chunks)):
chunk.to_csv(f'/data/bs_{idx}.csv')
Use id in the filename else it will not work. You missed id, and without id, it gives an error.
for id, df_i in enumerate(np.array_split(df, number_of_chunks)):
df_i.to_csv('/data/bs_{id}.csv'.format(id=id))

iteratively read (tsv) file for Pandas DataFrame

I have some experimental data which looks like this - http://paste2.org/YzJL4e1b (too long to post here). The blocks which are separated by field name lines are different trials of the same experiment - I would like to read everything in a pandas dataframe but have it bin together certain trials (for instance 0,1,6,7 taken together - and 2,3,4,5 taken together in another group). This is because different trials have slightly different conditions and I would like to analyze the results difference between these conditions. I have a list of numbers for different conditions from another file.
Currently I am doing this:
tracker_data = pd.DataFrame
tracker_data = tracker_data.from_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)
tracker_data['GazePointXLeft'] = tracker_data['GazePointXLeft'].astype(np.float64)
but this of course just reads everything in one go (including the field name lines) - it would be great if I could nest the blocks somehow which allows me to easily access them via numeric indices...
Do you have any ideas how I could best do this?
You should use read_csv rather than from_csv*:
tracker_data = pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)
If you want to join a list of DataFrames like this you could use concat:
trackers = (pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4) for i in range(?))
df = pd.concat(trackers)
* which I think is deprecated.
I haven't quite got it working, but I think that's because of how I copy/pasted the data. Try this, let me know if it doesn't work.
Using some inspiration from this question
pat = "TimeStamp\tGazePointXLeft\tGazePointYLeft\tValidityLeft\tGazePointXRight\tGazePointYRight\tValidityRight\tGazePointX\tGazePointY\tEvent\n"
with open('rec.txt') as infile:
header, names, tail = infile.read().partition(pat)
names = names.split() # get rid of the tabs here
all_data = tail.split(pat)
res = [pd.read_csv(StringIO(x), sep='\t', names=names) for x in all_data]
We read in the whole file so this won't work for huge files, and then partition it based on the known line giving the column names. tail is just a string with the rest of the data so we can split that, again based on the names. There may be a better way than using StringIO, but this should work.
I'm note sure how you want to join the separate blocks together, but this leaves them as a list. You can concat from there however you desire.
For larger files you might want to write a generator to read until you hit the column names and write a new file until you hit them again. Then read those in separately using something like Andy's answer.
A separate question from how to work with the multiple blocks. Assuming you've got the list of Dataframes, which I've called res, you can use pandas' concat to join them together into a single DataFrame with a MultiIndex (also see the link Andy posted).
In [122]: df = pd.concat(res, axis=1, keys=['a', 'b', 'c']) # Use whatever makes sense for the keys
In [123]: df.xs('TimeStamp', level=1, axis=1)
Out[123]:
a b c
0 NaN NaN NaN
1 0.0 0.0 0.0
2 3.3 3.3 3.3
3 6.6 6.6 6.6
I ended up doing it iteratively. very very iteratively. Nothing else seems to work.
pat = 'TimeStamp GazePointXLeft GazePointYLeft ValidityLeft GazePointXRight GazePointYRight ValidityRight GazePointX GazePointY Event'
with open(bhpath+fileid+'_wmet.tsv') as infile:
eye_data = infile.read().split(pat)
eye_data = [trial.split('\r\n') for trial in eye_data] # split at '\r'
for idx, trial in enumerate(eye_data):
trial = [row.split('\t') for row in trial]
eye_data[idx] = trial

Categories