CSV Manipulation in python - python

I want to be able to change the CSV data as we can do in javascript for JSON. Just code and object manipulation, like -
var obj = JSON.parse(jsonStr);
obj.name = 'foo bar';
var modifiedJSON = JSON.stringify(obj)
how can I do like this but for CSV files and in python ?
Something like -
csvObject = parseCSV(csvStr)
csvObject.age = 10
csvObject.name = csvObject.firstName + csvObject.lastName
csvStr = toCSV(csvObject)
I have a csv file customers.csv
ID,Name,Item,Date these are the columns. eg of the csv file -
ID,LastName,FirstName,Item,Date
11231249015,Derik,Smith,Televisionx1,1391212800000
24156246254,Doe,John,FooBar,1438732800000
I know very well that the python csv library can handle it but can it be treated as an object as whole and then manipulate ?
I basically want to combine the firstname and lastname, and perform some math with the IDs, but in the way javascript handles JSON

Not sure but maybe you want to use https://github.com/samarjeet27/CSV-Mapper
Install using pip install csvmapper
import csvmapper
# create parser instance
parser = csvmapper.CSVParser('customers.csv', hasHeader=True)
# create object
customers = parser.buildDict() # buildObject() if you want object
# perform manipulation
for customer in customers:
customer['Name'] = customer['FirstName'] + ' ' + customer['LastName']
# remove last name and firstname
# maybe this was what you wanted ?
customer.pop('LastName', None)
customer.pop('FirstName', None)
print customers
Output
[{'Name': 'Smith Derik', 'Item': 'Televisionx1', 'Date': '1391212800000', 'ID': '11231249015'}, {'Name': 'John Doe', 'Item': 'FooBar', 'Date': '1438732800000', 'ID': '24156246254'}]
This combines the firstName and lastName by accessing it as a dict, as maybe you want to remove the last name and firstname I think, replacing it with just a 'name' property. You can use parser.buildObject() if you want to access it as in javascript
Edit
You can save it back to CSV too.
writer = csvmapper.CSVWriter(customers) # modified customers from the above code
writer.write('customers-final.csv')
And regarding being able to perform math, you could use a custom mapper file like
mapper = csvmapper.DictMapper(x = [
[
{ 'name':'ID' ,'type':'long'},
{ 'name':'LastName' },
{ 'name':'FirstName' },
{ 'name':'Item' },
{ 'name':'Date', 'type':'int' }
]
]
parser = csvmapper.CSVParser('customers.csv', mapper)
And specify the type(s)

JSON can, by design, represent various kinds of data in various kinds of arrangements (objects, arrays...) and you can nest these if you wish. This means that its relatively easy to serialise and deserialise complex objects.
On the other-hand, CSV is just rows and columns of data. No structured objects, arrays, nesting, etc. So you basically have to know ahead of time what you're dealing with, and then manually map these to corresponding objects.
That said, Python's CSV module does have dict reader functionality, which will let you open a CSV file as a python dictionary consisting of the CSV's rows. It automatically maps the first / header row to field-names, but you can also pass-in the field-names to the constructor. You can therefore reference a property from a row by using the corresponding column header / fieldname. It also has a corresponding dict writer class. If you don't need any fancy nesting or complex data structures, then these may be all you really need?
This example is directly from the python module documentation:
import csv
with open('names.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['first_name'], row['last_name'])

Related

How to access nested attribute without passing parent attribute in pyspark json

I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4
Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.

What's the best way to convert string array into a table?

Given some array parsed from a CSV as follows (don't worry about the parsing part, just consider this array as the start point).
say:
['name,age,city', 'tom,12,new york','john, 10, los angeles']
Such that the first index is the column names, what's the best way to convert this into a table. I was thinking of using numpy and pandas to create a dataframe, but what would be the most memory/time efficient way to convert to do this? Then I am planning do some data analysis and create some new features. Is there something in the standard python library I can use or is pandas the best way to go about this? If I was to use just builtin functions how would I go about this? At the end I would need to combine the features back into the original form of an array.
Builtins only (aside from pprint for printing):
import pprint
data = [
"name,age,city",
"tom,12,new york",
"john, 10, los angeles",
]
cols = None
out_data = []
for line in data:
line = line.split(",")
# We don't know the columns yet; must be the first line
if not cols:
cols = line
continue
out_data.append(dict(zip(cols, line)))
pprint.pprint(out_data)
Using the csv standard module:
import csv
import io
import pprint
data = [
"name,age,city",
"tom,12,new york",
"john, 10, los angeles",
]
reader = csv.DictReader(io.StringIO('\n'.join(data)))
out_data = list(reader)
pprint.pprint(out_data)
Both approaches output the expected:
[{'age': '12', 'city': 'new york', 'name': 'tom'},
{'age': ' 10', 'city': ' los angeles', 'name': 'john'}]
Pandas is the way to go. You do not need to parse values. Instead you can just use read_csv functionality to create a dataframe out of your CSV file and do feature generation/extraction or data cleaning on this frame. Python standard library does not/should not offer such capability out of box.
To gather your values as a Python list at the end of the day use df.values.tolist().
pandas has C code in critical sections which makes it orders of magnitude faster.
I can't speak for efficiency sake, but as far as an easy way to convert it to a table goes using pandas would be the best option. I would use pandas.read_csv for it.

Finding and replacing values in specific columns in a CSV file using dictionaries

My Goal here is to clean up address data from individual CSV files using dictionaries for each individual column. Sort of like automating the find and replace feature from excel. The addresses are divided into columns. Housenumbers, streetnames, directions and streettype all in their own column. I used the following code to do the whole document.
missad = {
'Typo goes here': 'Corrected typo goes here'}
def replace_all(text, dic):
for i, j in missad.items():
text = text.replace(i, j)
return text
with open('original.csv','r') as csvfile:
text=csvfile.read()
text=replace_all(text,missad)
with open('cleanfile.csv','w') as cleancsv:
cleancsv.write(text)
While the code works, I need to have separate dictionaries as some columns need specific typo fixes.For example for the Housenumbers column housenum , stdir for the street direction and so on each with their column specific typos:
housenum = {
'One': '1',
'Two': '2
}
stdir = {
'NULL': ''}
I have no idea how to proceed, I feel it's something simple or that I would need pandas but am unsure how to continue. Would appreciate any help! Also is there anyway to group the typos together with one corrected typo? I tried the following but got an unhashable type error.
missad = {
['Typo goes here',Typo 2 goes here',Typo 3 goes here']: 'Corrected typo goes here'}
is something like this what you are looking for?
import pandas as pd
df = pd.read_csv(filename, index_col=False) #using pandas to read in the CSV file
#let's say in this dataframe you want to do corrections on the 'column for correction' column
correctiondict= {
'one': 1,
'two': 2
}
df['columnforcorrection']=df['columnforcorrection'].replace(correctiondict)
and use this idea for other columns of interest.

Merge Multiple CSV with different column name but same definition

I have different sources(CSV) of similar data set which i want to merge into single data and write it to my DB. Since data is coming from different sources, they use different headers in their CSV, i want to merge these columns with logical meaning.
So far, i have tried reading all headers first and re reading the files to first get all the data in a single data frame and then doing if else to merge the columns together with same meaning. Ideally I would like to create a mapping file with all possible column names per column and then read CSV using that mapping. The data is not ordered or sorted between files. Number of columns might be different too but they all have the columns i am interested in.
Sample data:
File 1:
id, name, total_amount...
1, "test", 123 ..
File 2:
member_id, tot_amnt, name
2, "test2", 1234 ..
i want this to look like
id, name, total_amount...
1, "test", 123...
2, "test2", 1234...
...
I can't think of an elegant way to do this, would be great to get some direction or help with this.
Thanks
Use skiprows and header=None to skip the header, names to specify your own list of column names, and concat to merge into a single df. i.e.
import pandas as pd
pd.concat([
pd.read_csv('file1.csv',skiprows=1,header=None,names=['a','b','c']),
pd.read_csv('file2.csv',skiprows=1,header=None,names=['a','b','c'])]
)
Edit: If the different files differ only by column order you can specify different column orders to names and if you want to select a subset of columns use usecols. But you need to do this mapping in advance, either by probing the file, or some other rule.
This requires mapping files to handlers somehow
i.e.
file1.csv
id, name, total_amount
1, "test", 123
file2.csv
member_id, tot_amnt, ignore, name
2, 1234, -1, "test2"
The following selects the common 3 columns and renames / reorders.
import pandas as pd
pd.concat([
pd.read_csv('file1.csv',skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
pd.read_csv('file2.csv',skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])],
sort=False
)
Edit 2:
And a nice way to apply this is to use lambda's and maps - i.e.
parsers = {
"schema1": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
"schema2": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])
}
map = {
"file2.csv": "schema2",
"file1.csv": "schema1"}
pd.concat([parsers[v](k) for k,v in map.items()], sort=False)
This is what i ended up doing and found to be the cleanest solution. Thanks David your help.
dict1= {'member_number': 'id', 'full name': 'name', …}
dict2= {'member_id': 'id', 'name': 'name', …}
parsers = {
"schema1": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys())),
"schema2": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys()))
}
map = {
'schema1': (a_file.csv,dict1),
'schema2': (b_file.csv,dict2)
}
total = []
for k,v in map.items():
d = parsers[k](v[0], v[1])
d.rename(columns=v[1], inplace=True)
total.append(d)
final_df = pd.concat(total, sort=False)

Format JSON objects in python

I have json objects in a notepad(C:\data.txt).There are millions of records I just used one record as an example.But I want to see only data on my notepad like:
1 123-567-9876 TEST1 TEST 717-567-9876 Harrisburg null US_PA
I dont want paranthesis,etc
Once I get the clean data,plan is to import the data from notepad(say C:\data2.txt) into SQL database.
This is the format of json object.
{
"status":"ok",
"items":[
{
"1":{
"Work_Phone":"123-567-9876",
"Name_Part":[
"TEST1",
"TEST"
],
"Residence_Phone":"717-567-9876",
"Mailing_City":"Harrisburg",
"Mailing_Street_Address_line_1":"",
"Cell_Phone":null,
"Mailing_Country_AND_Province_OR_State":"US_PA"
}
}
]
}
Can someone pls help with python code to format this json object and export it to notepad.
You can use
import simplejson as json
Then you can open your file and load it into a python-Dictionary:
f = file("C:/data.txt","r")
data = json.loads(f.read())
But this works only, when the json-objects are stored in an array in your file. So this has to look like this:
[{ ... first date ...},
{... second date ...},
...,
{... last date ...}]
Then in data there is an array of dictionaries. Now you can write the dates in another file:
g = file("output.txt","w")
for d in data:
for i in items:
for k in i.keys:
g.write(... some string build from the parameters ...)
If well done the file output.txt contains the lines. In detail it might be difficult becaus each item seems to contain some arrays.

Categories