Python: error reading and manipulating DataFrame data

Python: error reading and manipulating DataFrame data - python

I have a DataFrame variable called "obsData", which has the structure:
I then use this variable as an input to a code (with much help from Stackoverflow) that sorts all hourly data into one row for each day using:
f = obsData
data = {}
for line in f:
if 'Date' not in line or 'Temp' not in line:
k, v, = line.split() # split line in 2 parts, v and k
temperature = v.split(';')[1]
if k not in data:
data[k] = [temperature]
else:
data[k].append(temperature)
for k, v in data.items():
outPut = "{} ;{}".format(k, ";".join(v))
My issue it that the variable "line" never manages to get past the first row of the data in "obsData". It only manages to read 'Date' but not the second column 'Temp'. As a consequence the split function tries to split 'Date' but since its only one value I get the error:
ValueError: not enough values to unpack (expected 2, got 1)
I have tried to redefine "f" (i.e. "obsData") from a DataFrame into a ndarray or string using to make it easier for the code to work with the data:
f = f.values # into ndarry
f = f.astype(str) # into string try 1
f[['Date', 'Temp']] = f[['Date', 'Temp']].astype(str) # into string try 2
But for some reason I don't understand I cant convert it. What am I doing wrong? Any help is much appreciated!
EDIT for clarification: I get the error at the line with
k, v, = line.split()

When importing csv data it's best to use pandas
import pandas as pd
df = pd.read_csv('obsData.csv')
if you still need to loop check itertuples

Related

Possible Null Value Cells In csv File

Good evening,
May I please get advice on a section of code that I have? Here is the code:
import pandas as pd
import numpy as np
import os
import logging
#logging.getLogger('tensorflow').setLevel(logging.ERROR)
os.chdir('/media/Maps/test_run/')
coasts_data='new_coastline_combined.csv'
coasts = pd.read_csv(coasts_data, header=None, encoding='latin1')
#Combine all data into a single frame
frames = [coasts]
df = pd.concat(frames)
n = 1000 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
l=[]
for index, frame in enumerate(list_df):
vals = frame.iloc[:,2].values
vals = eval(vals)
# if any values in this part of the frame are wrong, store index for deletion
if np.any(vals < -100000):
l.append(index)
for x in sorted(l, reverse=True):
del list_df[x]
df1 = pd.concat(list_df)
df1.to_csv(r'/media/test_run_6/test.csv',index=False,header=False)
What the code does is take data from a csv file, break it into groups of 1000 rows each, determine if erroneous values are present, and delete that group of data. It works fine on most csv files. However, on one csv file, I'm getting the following error:
TypeError: '<' not supported between instances of 'str' and 'int'
The error begins at this section of code:
if np.any(vals < -100000):
I suspect (and please correct me if I'm wrong) that there are null (empty) cells in this particular column of values inside the csv (the csv is 6,000,000 rows deep, btw).
May I please get help on finding out what the problem is and how to fix it? Thank you,

I recommend catching the error to find out which line it is causing it and then manually check what problem has occured in your data.
Something like this:
try:
if np.any(vals < -100000):
l.append(index)
except TypeError:
print(vals)
print(index)
If it's really just a empty cells you could check that and ignore these cells.

How do I change the value type (from float to integer) in a list of dictionaries?

I am trying to parse an Excel with the output as list of dictionaries. I wish to change the type of one of two columns:
Date : Any date format
Account - see pic)
from Float to integer (in the excel it has no decimal values)
How do I make it happen so that it is saved permanently for further code on reference of this list of dictionary?
Output of my code is as seen in the picture here:
I tried various options but unsuccessful in making the change and have it displayed as output.
My code:
import xlrd
workbook = xlrd.open_workbook('filelocation')
ws = workbook.sheet_by_index(0)
first_row = [] # The first row with values in column
for col in range(ws.ncols):
first_row.append(ws.cell_value(0, col))
# creating a list of dictionaries
data = []
for row in range(1, ws.nrows):
d = {}
for col in range(ws.ncols):
d[first_row[col]] = ws.cell_value(row, col)
data.append(d)
for i in data:
if i['Account'] in data:
i['Account'] = int(i['Account'])
print(int(i['Account']))
print(data)
I added the last part to make changes on Account column but it does save the changes in the output.

Your problem is with the condition if i['Acount'] in data:.
data is a list of dicts. i['Acount'] is a float. So the above condition is never met, and any value gets converted to int.
From what I understand from your code you can simply remove the condition:
for i in data:
i['Acount'] = int(i['Acount'])
If you want to generally change all floats to ints, you could change the part where you read the file to:
for col in range(ws.ncols):
value = es.cell_value(row, col)
try:
value = int(value)
finally:
d[first_row[col]] = value

You have the right idea, but the syntax is a little off.
for element in data: #iterate through list
for key, value in element.items(): #iterate through each dict
if type(value) == float:
element[key] = int(value)
This will cast all your floats to ints.

How to print max and min value from a long file?

So Im having problem to print the max and min value from a file, the file has over 3000 lines and look like this:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
...
...
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
So this is my current code, I have no idea how to make it work, because now it only prints 3968 value because obviously it is the largest but I want the largest and smallest value from the second column (all the stock prices).
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[0:]
stock_file.close()
print(max(data))

Your current code outputs the "correct" output by chance, since it is using string comparison.
Consider this:
with open('test.txt') as f:
lines = [line.split(', ') for line in f.readlines()[1:]]
# lines is a list of lists, each sub-list represents a line in a format [date, value]
max_value_date, max_value = max(lines, key=lambda line: float(line[-1].strip()))
print(max_value_date, max_value)
# '2015-10-09' '112.120003'

Your current code is reading each line as a string and then finding max and min lines for your list. You can use pandas to read the file as CSV and load it as data frame and then do your min, max operations on data frame
Hope following answers your question
stocks = []
data=data[1:]
for d in data:
stocks.append(float(d.split(',')[1]))
print(max(stocks))
print( min(stocks))

I recommend Pandas module to work with tabular data and use read_csv function. Is very well documented, optimized and very popular for this purposes. You can install it with pip using pip install pandas.
I created a dumb file with your format and stored in a file called test.csv:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
Then, to parse the file you can do as follows. Names parameter defines the name of the columns. Skiprows allows you to skip the first line.
#import module
import pandas as pd
#load file
df = pd.read_csv('test.csv', names=['date', 'value'], skiprows=[0])
#get max and min values
max_value = df['value'].max()
min_value = df['value'].min()

You want to extract the second column into a float using float(datum.split(', ')[1].strip()), and ignore the first line.
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[1:] #ignore first line
stock_file.close()
data = [datum.split(', ') for datum in data]
max_value_date, max_value = max(data, key=lambda data: float(data[-1].strip()))
print(max_value_date, max_value)

or you can use do it in a simpler way: make a list of prices and then get the maximum and minimum. like this:
#as the first line in your txt is not data
datanew=data[1:]
prices=[]
line_after=[]
for line in datanew:
line_after=line.split(',')
price=line_after[1]
prices.append(float(price))
maxprice=max(prices)
minprice=min(prices)

Creating dictionary from CVS file using lists

I have a csv file which contains four columns and many rows, each representing different data, e.g.
OID DID HODIS BEAR
1 34 67 98
I have already opened and read the csv file, however I am unsure how I can make each column into a key. I believe the following format I have used in the code is best for the task I am creating.
Please see my code bellow, sorry if the explanation is a bit confusing.
Note that the #Values in column 1 is what I am stuck on, I am unsure how I can define each column.
for line in file_2:
the_dict = {}
OID = line.strip().split(',')
DID = line.strip().split(',')
HODIS = line.strip().split(',')
BEAR = line.strip().split(',')
the_dict['KeyOID'] = OID
the_dict['KeyDID'] = DID
the_dict['KeyHODIS'] = HODIS
the_dict['KeyBEAR'] = BEAR
dictionary_list.append(the_dict)
print(dictionary_list)
image

There is a great Python function for strings that will split strings based on a delimiter, .split(delim) where delim is the delimiter, and returns them as a list.
From the code that you have in your screenshot, you can use the following code to split on a ,, which I assume is your delimiter because you said that your file is a CSV.
...
for line in file_contents_2:
the_dict = {}
values = line.strip().split(',')
OID = values[0]
DID = values[1]
HODIS = values[2]
BEAR = values[3]
...
Also, in case you ever need to split a string based on whitespace, that is the default argument for .split() (the default argument is used when no argument is provided).

I would say this as whole code:
lod = []
with open(file,'r') as f:
l=f.readlines()
for i in l[1:]:
lod.append(dict(zip(l[0].rstrip().split(),i.split())))
split doesn't need a parameter, just use simple for loop in with open, no need for knowing keys
And if care about empty dictionaries do:
lod=list(filter(None,lod))
print(lod)
Output:
[{'OID': '1', 'DID': '34', 'HODIS': '67', 'BEAR': '98'}]
If want integers:
lod=[{k:int(v) for k,v in i.items()} for i in lod]
print(lod)
Output:
[{'OID': 1, 'DID': 34, 'HODIS': 67, 'BEAR': 98}]

Another way to do it is using libraries like Pandas, that is powerful in working with tabular data. It is fast as we avoid loops. In the example below you only need Pandas and the name of the CSV file. I used io just to transform string data to mimic csv.
import pandas as pd
from io import StringIO
data=StringIO('''
OID,DID,HODIS,BEAR\n
1,34,67,98''') #mimic csv file
df = pd.read_csv(data,sep=',')
print(df.T.to_dict()[0])
At the bottom you need only one-liner that chains commands. Read csv, transpose and tranform to dictionary:
import pandas as pd
csv_dict = pd.read_csv('mycsv.csv',sep=',').T.to_dict()[0]

Python CSV - Check if index is equal on different rows

I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.

What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
'time':[entry[2]],
'ABC':[''.join(entry[3:])],
}
# customer_id is already in dataDict, add values
else:
dataDict[entry[0]]['month'].append(entry[1])
dataDict[entry[0]]['time'].append(entry[2])
dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: error reading and manipulating DataFrame data - python

When importing csv data it's best to use pandas import pandas as pd df = pd.read_csv('obsData.csv') if you still need to loop check itertuples

Related

Possible Null Value Cells In csv File

How do I change the value type (from float to integer) in a list of dictionaries?

How to print max and min value from a long file?

Creating dictionary from CVS file using lists

Python CSV - Check if index is equal on different rows

Categories

Resources