Given some array parsed from a CSV as follows (don't worry about the parsing part, just consider this array as the start point).
say:
['name,age,city', 'tom,12,new york','john, 10, los angeles']
Such that the first index is the column names, what's the best way to convert this into a table. I was thinking of using numpy and pandas to create a dataframe, but what would be the most memory/time efficient way to convert to do this? Then I am planning do some data analysis and create some new features. Is there something in the standard python library I can use or is pandas the best way to go about this? If I was to use just builtin functions how would I go about this? At the end I would need to combine the features back into the original form of an array.
Builtins only (aside from pprint for printing):
import pprint
data = [
"name,age,city",
"tom,12,new york",
"john, 10, los angeles",
]
cols = None
out_data = []
for line in data:
line = line.split(",")
# We don't know the columns yet; must be the first line
if not cols:
cols = line
continue
out_data.append(dict(zip(cols, line)))
pprint.pprint(out_data)
Using the csv standard module:
import csv
import io
import pprint
data = [
"name,age,city",
"tom,12,new york",
"john, 10, los angeles",
]
reader = csv.DictReader(io.StringIO('\n'.join(data)))
out_data = list(reader)
pprint.pprint(out_data)
Both approaches output the expected:
[{'age': '12', 'city': 'new york', 'name': 'tom'},
{'age': ' 10', 'city': ' los angeles', 'name': 'john'}]
Pandas is the way to go. You do not need to parse values. Instead you can just use read_csv functionality to create a dataframe out of your CSV file and do feature generation/extraction or data cleaning on this frame. Python standard library does not/should not offer such capability out of box.
To gather your values as a Python list at the end of the day use df.values.tolist().
pandas has C code in critical sections which makes it orders of magnitude faster.
I can't speak for efficiency sake, but as far as an easy way to convert it to a table goes using pandas would be the best option. I would use pandas.read_csv for it.
Related
Im fairly new dealing with .txt files that has a dictionary within it. Im trying to pd.read_csv and create a dataframe in pandas.I get thrown an error of Error tokenizing data. C error: Expected 4 fields in line 2, saw 11. I belive I found the root problem which is the file is difficult to read because each row contains a dict, whose key-value pairs are separated by commas in this case is the delimiter.
Data (store.txt)
id,name,storeid,report
11,JohnSmith,3221-123-555,{"Source":"online","FileFormat":0,"Isonline":true,"comment":"NAN","itemtrack":"110", "info": {"haircolor":"black", "age":53}, "itemsboughtid":[],"stolenitem":[{"item":"candy","code":1},{"item":"candy","code":1}]}
35,BillyDan,3221-123-555,{"Source":"letter","FileFormat":0,"Isonline":false,"comment":"this is the best store, hands down and i will surely be back...","itemtrack":"110", "info": {"haircolor":"black", "age":21},"itemsboughtid":[1,42,465,5],"stolenitem":[{"item":"shoe","code":2}]}
64,NickWalker,3221-123-555, {"Source":"letter","FileFormat":0,"Isonline":false, "comment":"we need this area to be fixed, so much stuff is everywhere and i do not like this one bit at all, never again...","itemtrack":"110", "info": {"haircolor":"red", "age":22},"itemsboughtid":[1,2],"stolenitem":[{"item":"sweater","code":11},{"item":"mask","code":221},{"item":"jack,jill","code":001}]}
How would I read this csv file and create new columns based on the key-values. In addition, what if there are more key-value in other data... for example > 11 keys within the dictionary.
Is there a an efficient way of create a df from the example above?
My code when trying to read as csv##
df = pd.read_csv('store.txt', header=None)
I tried to import json and user a converter but it do not work and converted all the commas to a |
`
import json
df = pd.read_csv('store.txt', converters={'report': json.loads}, header=0, sep="|")
In addition I also tried to use:
`
import pandas as pd
import json
df=pd.read_csv('store.txt', converters={'report':json.loads}, header=0, quotechar="'")
I also was thinking to add a quote at the begining of the dictionary and at the end to make it a string but thought that was too tedious to find the closing brackets.
I think adding quotes around the dictionaries is the right approach. You can use regex to do so and use a different quote character than " (I used § in my example):
from io import StringIO
import re
import json
with open("store.txt", "r") as f:
csv_content = re.sub(r"(\{.*})", r"§\1§", f.read())
df = pd.read_csv(StringIO(csv_content), skipinitialspace=True, quotechar="§", engine="python")
df_out = pd.concat([
df[["id", "name", "storeid"]],
pd.DataFrame(df["report"].apply(lambda x: json.loads(x)).values.tolist())
], axis=1)
print(df_out)
Note: the very last value in your csv isn't valid json: "code":001. It should either be "code":"001" or "code":1
Output:
id name storeid Source ... itemtrack info itemsboughtid stolenitem
0 11 JohnSmith 3221-123-555 online ... 110 {'haircolor': 'black', 'age': 53} [] [{'item': 'candy', 'code': 1}, {'item': 'candy...
1 35 BillyDan 3221-123-555 letter ... 110 {'haircolor': 'black', 'age': 21} [1, 42, 465, 5] [{'item': 'shoe', 'code': 2}]
2 64 NickWalker 3221-123-555 letter ... 110 {'haircolor': 'red', 'age': 22} [1, 2] [{'item': 'sweater', 'code': 11}, {'item': 'ma...
My database has a column where all the cells have a string of data. There are around 15-20 variables, where the information is assigned to the variables with an "=" and separated by a space. The number and names of the variables can differ in the individual cells... The issue I face is that the data is separated by spaces and so are some of the variables. The variable name is in every cell, so I can't just make the headers and add the values to the data frame like a csv. The solution also needs to be able to do this process automatically for all the new data in the database.
Example:
Cell 1: TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520"... RELEASED="1880".
Cell 2: TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655"... MAIN CHARACTER="Ishmael".
I want to convert these strings of data into a structured dataframe like.
TITLE
AUTHOR
PAGES
RELEASED
MAIN
Brothers Karamazov
Fyodor Dostoevsky
520
1880
NaN
Moby Dick
Herman Meville
655
NaN
Ishmael
Any tips on how to move forwards? I have though about converting it into a JSON format by using the replace() function, before turning it into a dataframe, but have not yet succeeded. Any tips or ideas are much appreciated.
Thanks,
I guess this sample is what you need.
import pandas as pd
# Helper function
def str_to_dict(cell) -> dict:
normalized_cell = cell.replace('" ', '\n').replace('"', '').split('\n')
temp = {}
for x in normalized_cell:
key, value = x.split('=')
temp[key] = value
return temp
list_of_cell = [
'TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520" RELEASED="1880"',
'TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655" MAIN CHARACTER="Ishmael"'
]
dataset = [str_to_dict(i) for i in list_of_cell]
print(dataset)
"""
[{'TITLE': 'Brothers Karamazov', 'AUTHOR': 'Fyodor Dostoevsky', 'PAGES': '520', 'RELEASED': '1880'}, {'TITLE': 'Moby Dick', 'AUTHOR': 'Herman Melville', 'PAGES': '655', 'MAIN CHARACTER': 'Ishmael'}]
"""
df = pd.DataFrame(dataset)
df.head()
"""
TITLE AUTHOR PAGES RELEASED MAIN CHARACTER
0 Brothers Karamazov Fyodor Dostoevsky 520 1880 NaN
1 Moby Dick Herman Melville 655 NaN Ishmael
"""
Pandas lib can read them from a .csv file and make a data frame - try this:
import pandas as pd
file = 'xx.csv'
data = pd.read_csv(file)
print(data)
Create a Python dictionary from your database rows.
Then create Pandas dataframe using the function: pandas.DataFrame.from_dict
Something like this:
import pandas as pd
# Assumed data from DB, structure it like this
data = [
{
'TITLE': 'Brothers Karamazov',
'AUTHOR': 'Fyodor Dostoevsky'
}, {
'TITLE': 'Moby Dick',
'AUTHOR': 'Herman Melville'
}
]
# Dataframe as per your requirements
dt = pd.DataFrame.from_dict(data)
Here are some rows of the csv. I have looked over several stackexchange questions, and other resources but have not been able to come up with a clear solution. I converted my csv into a dictionary using DictReader. I am not sure how to turn the specific columns into a tuple key. I also need to filter for the year 2020. I am not sure how to do that without pandas either.
"year","state","state_po","county_name","county_fips","office","candidate","party","candidatevotes","totalvotes","version","mode"
2000,"ALABAMA","AL","AUTAUGA","1001","PRESIDENT","AL GORE","DEMOCRAT",4942,17208,20191203,"TOTAL"
2000,"ALABAMA","AL","AUTAUGA","1001","PRESIDENT","GEORGE W. BUSH","REPUBLICAN",11993,17208,20191203,"TOTAL"
2000,"ALABAMA","AL","AUTAUGA","1001","PRESIDENT","RALPH NADER","GREEN",160,17208,20191203,"TOTAL"
Dictionary comprehension is probably the most Pythonic way.
import csv
with open('csv') as file:
reader = csv.DictReader(file)
data = {(i['year'], i['state_po'], i['candidate']): i['candidatevotes'] for i in reader}
After which data is:
{('2000', 'AL', 'AL GORE'): '4942',
('2000', 'AL', 'GEORGE W. BUSH'): '11993',
('2000', 'AL', 'RALPH NADER'): '160'}
My Goal here is to clean up address data from individual CSV files using dictionaries for each individual column. Sort of like automating the find and replace feature from excel. The addresses are divided into columns. Housenumbers, streetnames, directions and streettype all in their own column. I used the following code to do the whole document.
missad = {
'Typo goes here': 'Corrected typo goes here'}
def replace_all(text, dic):
for i, j in missad.items():
text = text.replace(i, j)
return text
with open('original.csv','r') as csvfile:
text=csvfile.read()
text=replace_all(text,missad)
with open('cleanfile.csv','w') as cleancsv:
cleancsv.write(text)
While the code works, I need to have separate dictionaries as some columns need specific typo fixes.For example for the Housenumbers column housenum , stdir for the street direction and so on each with their column specific typos:
housenum = {
'One': '1',
'Two': '2
}
stdir = {
'NULL': ''}
I have no idea how to proceed, I feel it's something simple or that I would need pandas but am unsure how to continue. Would appreciate any help! Also is there anyway to group the typos together with one corrected typo? I tried the following but got an unhashable type error.
missad = {
['Typo goes here',Typo 2 goes here',Typo 3 goes here']: 'Corrected typo goes here'}
is something like this what you are looking for?
import pandas as pd
df = pd.read_csv(filename, index_col=False) #using pandas to read in the CSV file
#let's say in this dataframe you want to do corrections on the 'column for correction' column
correctiondict= {
'one': 1,
'two': 2
}
df['columnforcorrection']=df['columnforcorrection'].replace(correctiondict)
and use this idea for other columns of interest.
I want to be able to change the CSV data as we can do in javascript for JSON. Just code and object manipulation, like -
var obj = JSON.parse(jsonStr);
obj.name = 'foo bar';
var modifiedJSON = JSON.stringify(obj)
how can I do like this but for CSV files and in python ?
Something like -
csvObject = parseCSV(csvStr)
csvObject.age = 10
csvObject.name = csvObject.firstName + csvObject.lastName
csvStr = toCSV(csvObject)
I have a csv file customers.csv
ID,Name,Item,Date these are the columns. eg of the csv file -
ID,LastName,FirstName,Item,Date
11231249015,Derik,Smith,Televisionx1,1391212800000
24156246254,Doe,John,FooBar,1438732800000
I know very well that the python csv library can handle it but can it be treated as an object as whole and then manipulate ?
I basically want to combine the firstname and lastname, and perform some math with the IDs, but in the way javascript handles JSON
Not sure but maybe you want to use https://github.com/samarjeet27/CSV-Mapper
Install using pip install csvmapper
import csvmapper
# create parser instance
parser = csvmapper.CSVParser('customers.csv', hasHeader=True)
# create object
customers = parser.buildDict() # buildObject() if you want object
# perform manipulation
for customer in customers:
customer['Name'] = customer['FirstName'] + ' ' + customer['LastName']
# remove last name and firstname
# maybe this was what you wanted ?
customer.pop('LastName', None)
customer.pop('FirstName', None)
print customers
Output
[{'Name': 'Smith Derik', 'Item': 'Televisionx1', 'Date': '1391212800000', 'ID': '11231249015'}, {'Name': 'John Doe', 'Item': 'FooBar', 'Date': '1438732800000', 'ID': '24156246254'}]
This combines the firstName and lastName by accessing it as a dict, as maybe you want to remove the last name and firstname I think, replacing it with just a 'name' property. You can use parser.buildObject() if you want to access it as in javascript
Edit
You can save it back to CSV too.
writer = csvmapper.CSVWriter(customers) # modified customers from the above code
writer.write('customers-final.csv')
And regarding being able to perform math, you could use a custom mapper file like
mapper = csvmapper.DictMapper(x = [
[
{ 'name':'ID' ,'type':'long'},
{ 'name':'LastName' },
{ 'name':'FirstName' },
{ 'name':'Item' },
{ 'name':'Date', 'type':'int' }
]
]
parser = csvmapper.CSVParser('customers.csv', mapper)
And specify the type(s)
JSON can, by design, represent various kinds of data in various kinds of arrangements (objects, arrays...) and you can nest these if you wish. This means that its relatively easy to serialise and deserialise complex objects.
On the other-hand, CSV is just rows and columns of data. No structured objects, arrays, nesting, etc. So you basically have to know ahead of time what you're dealing with, and then manually map these to corresponding objects.
That said, Python's CSV module does have dict reader functionality, which will let you open a CSV file as a python dictionary consisting of the CSV's rows. It automatically maps the first / header row to field-names, but you can also pass-in the field-names to the constructor. You can therefore reference a property from a row by using the corresponding column header / fieldname. It also has a corresponding dict writer class. If you don't need any fancy nesting or complex data structures, then these may be all you really need?
This example is directly from the python module documentation:
import csv
with open('names.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['first_name'], row['last_name'])