Find a string in a huge string file

Find a string in a huge string file - python

I have to find a list of strings in a txt.file
The file has 200k+ lines
This is my code:
with open(txtfile, 'rU') as csvfile:
tp = pd.read_csv(csvfile, iterator=True, chunksize=6000, error_bad_lines=False,
header=None, skip_blank_lines=True, lineterminator="\n")
for chunk in tp:
if string_to_find in chunk:
print "hurrà"
The problem is that with this code only the first 9k lines are analyzed.
Why?

Do you really need to open the file first then use pandas? If it's an option you can just read with pandas then concatenate.
To do that just use read_csv, concat the files, then loop through them.
import pandas as pd
df = pd.read_csv('data.csv', iterator=True, chunksize=6000, error_bad_lines=False,
header=None, skip_blank_lines=True)
df = pd.concat(df)
# start the for loop
It depends on your for loop, pandas most likely will have a function that you won't need to loop as it's slower to process large data.

Related

Pandas Processing Large CSV Data

I am processing a Large Data Set with at least 8GB in size using pandas.
I've encountered a problem in reading the whole set so I read the file chunk by chunk.
In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.
I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.
I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.
After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.
I'm a newbie in Python so it would really help if someone can point me in the right direction.
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
low_memory=False)
# new_df = pd.DataFrame()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
' Step-3.csv'), mode='w', index=False, encoding='utf8')

If you can fit in memory the set of unique keys:
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False,
chunksize=CHUNK_SIZE,
low_memory=False)
# create a set of (unique) ids
all_ids = set()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
# Filter rows with key in all_ids
df = df.loc[~df['Unique Keys'].isin(all_ids)]
# Add new keys to the set
all_ids = all_ids.union(set(df['Unique Keys'].unique()))

Probably easier not doing it with pandas.
with open(input_csv_file) as fin:
with open(output_csv_file) as fout:
writer = csv.writer(fout)
seen_keys = set()
header = True
for row in csv.reader(fin):
if header:
writer.writerow(row)
header = False
continue
key = tuple(row[i] for i in key_indices)
if not all(key): # skip if key is empty
continue
if key not in seen_keys:
writer.writerow(row)
seen_keys.add(key)

I think this is a clear example of when you should use Dask or Pyspark. Both allow you to read files that does not fit in your memory.
As an example with Dask you could do:
import dask.dataframe as dd
df = dd.read_csv(filename, na_filter=False)
df = df.dropna(subset=["Unique Keys"])
df = df.drop_duplicates(subset=["Unique Keys"])
df.to_csv(filename_out, index=False, encoding="utf8", single_file=True)

Formatting a csv file I'm sorting

I'm currently sorting a csv file. As far as my output, its correct but it isn't properly formatted. The following is the file I'm sorting
And here is the output after I sort (I'll include the code after the image)
Obviously I'm having a delimiter issue, but here is my code:
with open(out_file, 'r') as unsort:##Opens OMI Data
with open(Pandora_Sorted,'w') as sort:##Opens file to write to
for line in unsort:
if "Datetime" in line:##Searches lines
writer=csv.writer(sort, delimiter = ',')
writer.writerow(headers)##Writes header
elif "T13" in line:
writer=csv.writer(sort)
writer.writerow(line)##Writes to output file

I think it's easier to read the csv file into a pandas data frame and sort, please check below sample code.
import pandas as pd
df = pd.read_csv(input_file)
df.sort_values(by = ['Datetime'], inplace = True)
df.to_csv(output_file)

Do you need to be explicit about your separator for the writer?
Here in the second line of your elif:
elif "T13" in line:
writer=csv.writer(sort, delimiter = ',')
writer.writerow(line) # Writes to output file

For provided code, the header also would also have formatting similar to other rows due to following line:
writer=csv.writer(sort, delimiter = ',')
Using pandas following can be used for sorting in ascending order by list of columns, list_of_columns
import csv
import pandas as pd
input_csv = pd.read_csv(out_file, sep=',')
input_csv.sort_values(by=list_of_columns, ascending=True)
input_csv.to_csv(Pandora_Sorted, sep=',')
for e.g. list_of_columns could be
list_of_columns = ['Datetime', 'JulianDate', 'repetition']

Prevent pandas from rewriting formatted header to csv for every chunk

I have a dirty csv with an ugly header that I have formatted and stored in a list.
I want to read this csv chunk by chunk, perform some regex on the data, and then write to a new csv.
I'm using this function to do so
def format_data(data_location, formatted_header):
df = pd.read_csv(data_location, sep=',', skiprows=1,
header=0, names=formatted_header, chunksize=10000)
for chunk in df:
chunk = chunk.replace('(?!(([^"]*"){2})*[^"]*$),', '', regex=True)
chunk.to_csv('formatted_data.csv', mode='a', index=False)
As I understand what I am doing here:
pd.read_csv(data_location, sep=',', skiprows=1,
header=0, names=formatted_header, chunksize=10000)
I am reading the csv from it's location, skipping the first ugly header row and replacing with my formatted_header.
My issue is that for each new chunk that is written to the new CSV, I am seeing the formatted header row repeated after every 10,000 rows. How can I prevent this from happening?

Since you only want to write the header once, use a boolean to see if you're on the first chunk.
For example:
write_header = True
for chunk in df:
chunk = chunk.replace('(?!(([^"]*"){2})*[^"]*$),', '', regex=True)
chunk.to_csv('formatted_data.csv', mode='a', index=False, header=write_header)
write_header = False

Pandas Chunksize iterator

I have a 1GB, 70M row file which anytime I load it all it runs out of memory. I have read in 1000 rows and been able to prototype what I'd like it to do.
My problem is not knowing how to get the next 1000 rows and apply my logic and then continue to run through my file until it finishes the last rows. I've read about chunksizing, although I can't figure out how to continue the iteration of the chunksizing.
Ideally, it would flow like such:
1)read in first 1000 rows
2)filter data based on criteria
3)write to csv
4)repeat until no more rows
Here's what i have so far:
import pandas as pd
data=pd.read_table('datafile.txt',sep='\t',chunksize=1000, iterator=True)
data=data[data['visits']>10]
with open('data.csv', 'a') as f:
data.to_csv(f,sep = ',', index=False, header=False)

You have some problems with your logic, we want to loop over each chunk in the data, not the data itself.
The 'chunksize' argument gives us a 'textreader object' that we can iterate over.
import pandas as pd
data=pd.read_table('datafile.txt',sep='\t',chunksize=1000)
for chunk in data:
chunk = chunk[chunk['visits']>10]
chunk.to_csv('data.csv', index = False, header = False)
You will need to think about how to handle your header!

When you pass a chunksize or iterator=True, pd.read_table returns a TextFileReader that you can iterate over or call get_chunk on. So you need to iterate or call get_chunk on data.
So proper handling of your entire file might look something like
import pandas as pd
data = pd.read_table('datafile.txt',sep='\t',chunksize=1000, iterator=True)
with open('data.csv', 'a') as f:
for chunk in data:
chunk[chunk.visits > 10].to_csv(f, sep=',', index=False, header=False)

Operations on a very large csv with pandas

I have been using pandas on csv files to get some values out of them. My data looks like this:
"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"
I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:
group freqW1 freqW2
A 1 0
B 1 0
C 0 1
Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.
I suspect there is some easy way to iterate through the csv and do what I want.
My code is like this:
df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()

You can specify a chunksize option in the read_csv call. See here for details
Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.

Okay I misunderstood the chunk parameter. I solved it by doing this:
frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks:
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
frame = frame.add(df1,fill_value=0)
outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find a string in a huge string file - python

Related

Pandas Processing Large CSV Data

Formatting a csv file I'm sorting

Prevent pandas from rewriting formatted header to csv for every chunk

Pandas Chunksize iterator

Operations on a very large csv with pandas

Categories

Resources