Python: How to properly map the data against the column in CSV - python

I am trying to write a set of data to csv file. The file have headers and the header name auto increments against the number of values in output of a that field. For example if I have a Additional Skills column, and there are 17 skills so the header will be like
Additional Skills 1 + Endorsements .... Additional Skills 17 + Endorsements
Now, when I am writing the data against the field, I am able to write it properly if there are exactly 17 fields. But if there is another set of data, which has let's say 10 fields, it does write in 10 fields, now Considering that there are other columns after the "Additional Skills + Endorsements" for example "School" column, instead of writing "school" data in 'school' column the data gets written in "Additional Skills 11 + Endorsements"
My Code for Adding column field is as follows:
profile_eksills_len = 0
for profile in link_data:
new_profile_eksills = len(profile["skillsExtended"])
if new_profile_eksills > profile_eksills_len:
profile_eksills_len = new_profile_eksills
for i in range(profile_eksills_len):
profile_header.append("Additional Skills {} + Endorsements".format(i+1))
Code for writing the CSV file is as follows:
with open("data.csv", "w") as output:
writer = csv.writer(output, dialect='excel', delimiter='\t', )
writer.writerow(profile_header)
# get job title
for profile in link_data:
exp_data = [profile['name'], profile['info'], profile['currentJob'] ]
for exp in profile["exp"]:
if exp['jobDesc']:
exp_data.append(exp['title'] + ":" + exp['comp'] + ":" + exp['jobDesc'])
else:
exp_data.append(exp['title'] + ":" + exp['comp'])
for exp in profile["extras"]:
exp_data.append(exp['extras'])
for edu in profile['edu']:
exp_data.append(edu['school'])
for skills in profile["skills"]:
exp_data.append(skills['sets'] + ":" + skills['endorsementCounts'])
for skill in profile["skillsExtended"]:
exp_data.append(skill['extsets'] + ":" + skill['endorsedCount'])
print(exp_data)
# write data column wise....
writer.writerow(exp_data)
I would like to know if there is a way to achieve this?

Assuming that you know all your headers in advance, the best approach would be to collect your row data in dictionaries and use csv.DictWriter to write to the file.
DictWriter will handle missing fields automatically. By default it will populate them with an empty string, but you can provide an alternative value via DictReader's restval parameter.
The outline code would look like this:
fieldnames = ['heading1', 'heading2', ...]
with open("data.csv", "w") as output:
writer = csv.DictWriter(output, fieldnames, dialect='excel', delimiter='\t', )
writer.writeheader()
# get job title
for profile in link_data:
# Build a dictionary of the fieldnames and values for the row.
row_data = {'heading1': 'foo', 'heading2': 'bar',...}
writer.writerow(row_data)

with open(file_name, encoding='utf-8') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
header = next(csv_reader)
print("header:", header)
for row in csv_reader:
data = dict(zip(header,row))
print('combined dict data:', data)
Read header and then use zip to get the list of values

Related

Efficient and Fast merger for multiple .CSV files

I have more than 15M tweets and I need to merger the ID and Text after dropping duplicates. I need most efficient way to do this as it is taking very long to complete?
frames = []
missed = 0
for q in query_list:
hashtag = q + '.csv'
try:
file_data = pd.read_csv(path + hashtag ,encoding='utf-8')
frames.append(file_data)
except:
missed+= 1
continue
df = pd.concat(frames)
df = df[['id','text']]
df = df.drop_duplicates()
df.to_csv('row_tweets.csv',index=False)
If you want unique pairs of (id, text), I'd just do it in pure python using set for easy de-duplication, and csv readers/writers:
import csv
id_text_pairs = set() # set of (id, text) pairs
missed = 0
for q in query_list:
hashtag = q + '.csv'
try:
with open(path + hashtag, 'r') as infile:
reader = csv.DictReader(infile)
for row in reader:
id_text_pairs.add( (row['id'], row['text']) ) # this won't add duplicates
except:
missed += 1
continue
with open('row_tweets.csv', 'w') as outfile:
col_names = ['id', 'text']
writer = csv.DictWriter(outfile, fieldnames=col_names)
writer.writeheader() # First line is the 'id,text' header
for id, text in id_text_pairs:
writer.writerow({'id': id, 'text': text}) # write each id,text pair
That should do it, and I believe will be more efficient in de-duping than a huge dataframe call at the end. Note that if your text's contain commas, you might want to output in tab-delimited format using the DictWriter argument delimiter='\t', or the quotechar and quoting arguments, check out the csv documentation here.

How to filter columns within a .CSV file and then save those filtered columns to a new .CSV file in Python?

I am analyzing a large weather data file, Data.csv. I need to write a program in Python that will filter the Data.csv file and keep the following columns only: STATION, NAME/LOCATION, DATE, AWND, SNOW. Then save the filtered file and name it filteredData.csv.
I am using Python 3.8. I have only been able to somewhat figure out how to filter the columns I need within a print function. How do I filter this file and then save the filtered file?
import csv
filename = 'Data.csv'
f = open(filename, 'rt')
reader = csv.reader(f,delimiter=',')
for column in reader:
print(column[0] + "," + column[1] + "," + column[2] + "," + column[3] + "," + column[4] + "," + column[13])
A small section of the Data.csv file
It can be quickly done using Pandas
import pandas as pd
weather_data = pd.read_csv('Data.csv')
filtered_weather = weather_data[['Column_1','Column_1']] #Select the column names that you want
filtered_weather.to_csv('new_file',index=False)
If you're running this under windows you can simply run the code you already wrote with "> newfile.csv" at the end of the command to plug output into a test file.
If you want to do it within the code though:
import csv
new_filename = 'Reduced_Data.csv'
filename = 'Data.csv'
f = open(filename, 'rt')
reader = csv.reader(f,delimiter=',')
for row in reader:
with open(new_filename, 'a') as output:
output.write('"{}","{}","{}","{}","{}","{}"\n'.format(column[0],column[1],column[2],column[3],column[4],column[13]))
check out the CSV reader and this example. you can do something like:
import csv
content = []
with open('Data.csv', 'r') as file:
reader = csv.reader(file, delimiter = ','))
for row in reader:
content.append(row)
print(content)
## now writing them in a file:
with open('filteredData.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['STATION', 'NAME LOCATION', 'DATE', 'AWND', 'SNOW'])
for i in range(1, len(content)):
writer.writerow[content[i][0], content[i][1], content[i][2], content[i][3], content[i][13]) ## i left out some columns, so they will not be in the file later, maybe I did not get that right.
but honestly, I would use this approach but that means only copy & paste.

Rearranging data - row into multiple columns

So I have csv file with over 1m records:(https://i.imgur.com/rhIhy5u.png)
I need data to be arranged differently that "params" who repeats become column/row themselves for example category1, category2, category3 (there is over 20 categories and no repeats) but all the data maintain their relations.
I tried using "pandas" and "csv" in python but i am completly new to it and i never had anything to do with such a data.
import csv
with open('./data.csv', 'r') as _filehandler:
csv_file_reader = csv.reader(_filehandler)
param = [];
csv_file_reader = csv.DictReader(_filehandler)
for row in csv_file_reader:
if not row['Param'] in param:
param.append(row['Param']);
col = "";
for p in param:
col += str(p) + '; ';
print(col);
import numpy as np
np.savetxt('./SortedWexdord.csv', (parameters), delimiter=';', fmt='%s')
I tried to think about it but data is nor my forte, any ideas?
Here's something that should work. If you need more than one value per row normalized like this, you could edit line 9 (beginning category) to grab a list of values instead of just row[1].
import csv
data = {}
with open('data.csv', 'r') as file:
reader = csv.reader(file)
next(reader) # Skip header row
for row in reader:
category, value = row[0], row[1] # Assumes category is in column 0 and target value is in column 1
if category in data:
data[category].append(value)
else:
data[category] = [value] # New entry only for each unique category
with open('output.csv', 'wb') as file: # wb is write and binary, avoids double newlines on windows
writer = csv.writer(file)
writer.writerow(['Category', 'Value'])
for category in data:
print([category] + data[category])
writer.writerow([category] + data[category]) # Make a list starting with category and then listing each value

Read only defined columns of CSV

I wrote a python program that joins 2 csv tables according to a matching key.
My data looks like this:
Table 1:
ID;NAME;ADRESS;TEL
1; Lee; Str.; 12345
2; Chu; Blv.; 34567
Table 2:
AID; FID; XID
50 1 99
676 2 678
My code looks like this:
data = OrderedDict()
fieldnames = []
with open(join_file, "rt") as fp:
reader = csv.DictReader(fp, dialect=excel_semicolon)
fieldsB = reader.fieldnames
fieldnames.extend(fieldsB)
for row in reader:
data.setdefault(row["FID"], {}).update(row)
with open(fileA, "rt") as fp:
reader = csv.DictReader(fp, dialect=excel_semicolon)
fieldnames.extend(reader.fieldnames)
for row in reader:
data.setdefault(row["ID"], {}).update(row)
fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged2.csv", "wt", newline='') as fp:
writer = csv.writer(fp, dialect=excel_semicolon)
writer.writerow(fieldnames)
for row in data.values():
writer.writerow([row.get(field, '') for field in fieldnames],)
The join operation works like this, but my problem is that I want to remove certain fields from table 2 from the joined csv (e.g. XID). Is there a simple way to do this?
My solution prior to this was using Pandas but the script should run on a server where I don't want to (can't) install the dependencies for the import.
If you wish to take something out you can put in a simple filter using list comprehension.
You create the list here.
fieldnames = list(OrderedDict.fromkeys(fieldnames))
filter out what you do not want.
filtered_fieldnames = [x for x in fieldnames if x != 'XID']
Then change the new file data to the filtered list.
with open("merged2.csv", "wt", newline='') as fp:
writer = csv.writer(fp)
writer.writerow(filtered_fieldnames)
for row in data.values():
writer.writerow([row.get(field, '') for field in filtered_fieldnames],)
You can wrap it in a function and call it when you either create a new file or wish to take something out..
def create_merged_file(names):
with open("merged2.csv", "wt", newline='') as fp:
writer = csv.writer(fp)
writer.writerow(names)
for row in data.values():
writer.writerow([row.get(field, '') for field in names],)
create_merged_file(fieldnames)
filtered_fieldnames = [x for x in fieldnames if x != 'XID']
create_merged_file(filtered_fieldnames)

Python to insert quotes to column in CSV

I have no knowledge of python.
What i want to be able to do is create a script that will edit a CSV file so that it will wrap every field in column 3 around quotes. I haven't been able to find much help, is this quick and easy to do? Thanks.
column1,column2,column3
1111111,2222222,333333
This is a fairly crude solution, very specific to your request (assuming your source file is called "csvfile.csv" and is in C:\Temp).
import csv
newrow = []
csvFileRead = open('c:/temp/csvfile.csv', 'rb')
csvFileNew = open('c:/temp/csvfilenew.csv', 'wb')
# Open the CSV
csvReader = csv.reader(csvFileRead, delimiter = ',')
# Append the rows to variable newrow
for row in csvReader:
newrow.append(row)
# Add quotes around the third list item
for row in newrow:
row[2] = "'"+str(row[2])+"'"
csvFileRead.close()
# Create a new CSV file
csvWriter = csv.writer(csvFileNew, delimiter = ',')
# Append the csv with rows from newrow variable
for row in newrow:
csvWriter.writerow(row)
csvFileNew.close()
There are MUCH more elegant ways of doing what you want, but I've tried to break it down into basic chunks to show how each bit works.
I would start by looking at the csv module.
import csv
filename = 'file.csv'
with open(filename, 'wb') as f:
reader = csv.reader(f)
for row in reader:
row[2] = "'%s'" % row[2]
And then write it back in the csv file.

Categories