I have more than 15M tweets and I need to merger the ID and Text after dropping duplicates. I need most efficient way to do this as it is taking very long to complete?
frames = []
missed = 0
for q in query_list:
hashtag = q + '.csv'
try:
file_data = pd.read_csv(path + hashtag ,encoding='utf-8')
frames.append(file_data)
except:
missed+= 1
continue
df = pd.concat(frames)
df = df[['id','text']]
df = df.drop_duplicates()
df.to_csv('row_tweets.csv',index=False)
If you want unique pairs of (id, text), I'd just do it in pure python using set for easy de-duplication, and csv readers/writers:
import csv
id_text_pairs = set() # set of (id, text) pairs
missed = 0
for q in query_list:
hashtag = q + '.csv'
try:
with open(path + hashtag, 'r') as infile:
reader = csv.DictReader(infile)
for row in reader:
id_text_pairs.add( (row['id'], row['text']) ) # this won't add duplicates
except:
missed += 1
continue
with open('row_tweets.csv', 'w') as outfile:
col_names = ['id', 'text']
writer = csv.DictWriter(outfile, fieldnames=col_names)
writer.writeheader() # First line is the 'id,text' header
for id, text in id_text_pairs:
writer.writerow({'id': id, 'text': text}) # write each id,text pair
That should do it, and I believe will be more efficient in de-duping than a huge dataframe call at the end. Note that if your text's contain commas, you might want to output in tab-delimited format using the DictWriter argument delimiter='\t', or the quotechar and quoting arguments, check out the csv documentation here.
Related
I am new in python, I have one CSV file, it has more than 1000 rows, I want to merge particular rows and move those rows to another column, can any one help?
This is the source csv file I have:
I want to move emails under members column with comma separator, like this image:
To read csv files in Python, you can use the csv module. This code does the merging you're looking for.
import csv
output = [] # this will store a list of new rows
with open('test.csv') as f:
reader = csv.reader(f)
# read the first line of the input as the headers
header = next(reader)
output.append(header)
# we will build up groups and their emails
emails = []
group = []
for row in reader:
if len(row) > 1 and row[1]: # "UserGroup" is given
if group:
group[-1] = ','.join(emails)
group = row
output.append(group)
emails = []
else: # it isn't, assume this is an email
emails.append(row[0])
group[-1] = ','.join(emails)
# now write a new file
with open('new.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(output)
I have two csv files simulating patient data that I need to read in and compare.
Without using Pandas, I need to sort the second file by Subject_ID and append the sex of the patient to the first csv file. I don't know where to start without using Pandas. Any ideas?
So far my plan is to somehow work with a dictionary to try to re-group the second file.
with open('Patient_Sex.csv','r') as file_sex, open('Patient_FBG.csv','r') as file_fbg:
patient_reader = csv.DictReader(file_sex)
fbg_reader = csv.DictReader(file_fbg)
After this, it gets really muddy for me.
I think this is what you are looking for, assuming you are working with .csv files based on the data that you posted.
Basically you can just parse the files as JSON and then you can manipulate them easily.
import csv
import json
gender_data = []
full_data = []
with open("stack/new.csv", encoding="utf-8") as csvf:
csvReader = csv.DictReader(csvf)
for row in csvReader:
gender_data.append(row)
with open("stack/info.csv", encoding="utf-8") as csvf:
csvReader = csv.DictReader(csvf)
for row in csvReader:
full_data.append(row)
for x in gender_data:
for y in full_data:
if x["SUBJECT_ID"] == y["SUBJECT_ID"]:
y["SEX"] = x["SEX"]
f = csv.writer(open("stack/test.csv", "w+"))
f.writerow(["SUBJECT_ID", "YEAR_1", "YEAR_2", "YEAR_3", "SEX"])
for x in full_data:
f.writerow(
[
x["SUBJECT_ID"],
x["YEAR_1"],
x["YEAR_2"],
x["YEAR_3"],
x["SEX"] if "SEX" in x else "",
]
)
You can do this without importing any modules by reading the csv files as lists of lines, and append lines in the main file with the sex upon a matching name:
with open('test1.csv') as csvfile:
main_csv = [i.rstrip() for i in csvfile.readlines()]
with open('test2.csv') as csvfile:
second_csv = [i.rstrip() for i in csvfile.readlines()]
for n, i in enumerate(main_csv):
if n == 0:
main_csv[n] = main_csv[n] + ',SEX'
else:
patient = i.split(',')[0]
hits = [line.split(',')[-1] for line in second_csv if line.startswith(patient)]
if hits:
main_csv[n] = main_csv[n] + ',' + hits[0]
else:
main_csv[n] = main_csv[n] + ','
with open('test.csv', 'w') as f:
f.write('\n'.join(main_csv))
I am trying to write a set of data to csv file. The file have headers and the header name auto increments against the number of values in output of a that field. For example if I have a Additional Skills column, and there are 17 skills so the header will be like
Additional Skills 1 + Endorsements .... Additional Skills 17 + Endorsements
Now, when I am writing the data against the field, I am able to write it properly if there are exactly 17 fields. But if there is another set of data, which has let's say 10 fields, it does write in 10 fields, now Considering that there are other columns after the "Additional Skills + Endorsements" for example "School" column, instead of writing "school" data in 'school' column the data gets written in "Additional Skills 11 + Endorsements"
My Code for Adding column field is as follows:
profile_eksills_len = 0
for profile in link_data:
new_profile_eksills = len(profile["skillsExtended"])
if new_profile_eksills > profile_eksills_len:
profile_eksills_len = new_profile_eksills
for i in range(profile_eksills_len):
profile_header.append("Additional Skills {} + Endorsements".format(i+1))
Code for writing the CSV file is as follows:
with open("data.csv", "w") as output:
writer = csv.writer(output, dialect='excel', delimiter='\t', )
writer.writerow(profile_header)
# get job title
for profile in link_data:
exp_data = [profile['name'], profile['info'], profile['currentJob'] ]
for exp in profile["exp"]:
if exp['jobDesc']:
exp_data.append(exp['title'] + ":" + exp['comp'] + ":" + exp['jobDesc'])
else:
exp_data.append(exp['title'] + ":" + exp['comp'])
for exp in profile["extras"]:
exp_data.append(exp['extras'])
for edu in profile['edu']:
exp_data.append(edu['school'])
for skills in profile["skills"]:
exp_data.append(skills['sets'] + ":" + skills['endorsementCounts'])
for skill in profile["skillsExtended"]:
exp_data.append(skill['extsets'] + ":" + skill['endorsedCount'])
print(exp_data)
# write data column wise....
writer.writerow(exp_data)
I would like to know if there is a way to achieve this?
Assuming that you know all your headers in advance, the best approach would be to collect your row data in dictionaries and use csv.DictWriter to write to the file.
DictWriter will handle missing fields automatically. By default it will populate them with an empty string, but you can provide an alternative value via DictReader's restval parameter.
The outline code would look like this:
fieldnames = ['heading1', 'heading2', ...]
with open("data.csv", "w") as output:
writer = csv.DictWriter(output, fieldnames, dialect='excel', delimiter='\t', )
writer.writeheader()
# get job title
for profile in link_data:
# Build a dictionary of the fieldnames and values for the row.
row_data = {'heading1': 'foo', 'heading2': 'bar',...}
writer.writerow(row_data)
with open(file_name, encoding='utf-8') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
header = next(csv_reader)
print("header:", header)
for row in csv_reader:
data = dict(zip(header,row))
print('combined dict data:', data)
Read header and then use zip to get the list of values
I have a master.txt file that I converted to master.csv, using comma as delimiter. I'm trying to copy only columns I want from master.csv to aircraft.csv. My first column contains such values as A1234 and CB34555, and these show up as a blank in aircraft.csv. The column header, "N-NUMBER", does show up, as does all data in other columns. How do I fix this to get my complete data?
import arcpy
import csv
import time
time.sleep(7)
arcpy.env.workspace = r"C:\GIS\final"
master = r"C:\GIS\final\MASTER.txt"
table = r"C:\GIS\final\MASTER.csv"
result = r"C:\GIS\final\aircraft.csv"
need = ["N-NUMBER", "NAME", "STREET", "STREET2", "CITY", "STATE", "ZIP CODE", "REGION", "TYPE AIRCRAFT"]
StartTime = time.clock()
in_txt = csv.reader(open(master, "rb"), delimiter = ',')
out_csv = csv.writer(open(table, 'wb'))
out_csv.writerows(in_txt)
del in_txt
del out_csv
EndTime = time.clock()
TotalTime = str(EndTime - StartTime)
print "Conversion Operation Complete in " + TotalTime + " seconds."
StartTime = time.clock()
with open(table) as infile, open(result, "wb") as outfile:
r = csv.DictReader(infile)
w = csv.DictWriter(outfile, need, extrasaction="ignore")
w.writeheader()
for row in r:
w.writerow(row)
EndTime = time.clock()
TotalTime2 = str(EndTime - StartTime)
print "Cleaning Operation Complete in " + TotalTime2 + " seconds."
I am going to suggest what I think might be a much simpler way to deal with this problem without using as many python modules:
EDIT: I added some code to extract the indices for your labels in "need" from the first row, assuming the first row is the header, and then storing these indices in "wanted"
need = ["N-NUMBER", "NAME", "STREET", "STREET2", "CITY", "STATE", "ZIP CODE", "REGION", "TYPE AIRCRAFT"]
with open(infile, 'r') as f:
header = f.readline().split(',') # read and split the header
wanted = [header.index(x) for x in need] # get the indices you want out of the header
rows = f.readlines() # creates list of each row as a string
table = [r.split(',') for r in rows] # splits each row on the ','
with open(outfile, 'w') as o:
o.write(','.join(header) + '\n') # re-join the split header, write it
for row in table:
out_string = ','.join([row[idx] for idx in wanted]) + '\n'
o.write(out_string) # write a new csv with the columns specified in "wanted"
Here you are just opening the file, reading all of the data, and then writing the data you want to a new file after specifying the select indices with "wanted". This should do the job without much overhead.
filename = 'result'
column = 'Latitude'
os.system("wget http://earthquake.usgs.gov/earthquakes/feed/csv/1.0/hour")
#csv_data = csv.reader(downloaded_data)
file = csv.reader(open('/home/coperthought/Documents/hour' , 'rb'), delimiter='\t')
data = [] # This will contain our data
# Create a csv reader object to iterate through the file
reader = csv.reader( open( '/home/coperthought/Documents/hour' , 'rU'), delimiter=',', dialect='excel')
hrow = reader.next() # Get the top row
idx = hrow.index(column) # Find the column of the data you're looking for
for row in reader: # Iterate the remaining rows
data.append( row[idx] )
os.remove ( '/home/coperthought/Documents/hour')
print data
then data is
['63.190', '63.730', '59.935', '38.805', '61.416', '63.213']
how can I get this into a string. Join is one..
thanks
how can I get this into a string. Join is one..
Just use 'whateveryouwanthere'.join(data).
You already mentioned the join method, you need to explain what's the problem here if you do not want a solution involving join.