extract information from excel into python 2d array - python

I have an excel sheet with dates, time, and temp that look like this:
using python, I want to extract this info into python arrays.
The array would get the date in position 0, and then store the temps in the following positions and look like this:
temparray[0] = [20130102,34.75,34.66,34.6,34.6,....,34.86]
temparray[1] = [20130103,34.65,34.65,34.73,34.81,....,34.64]
here is my attempt, but it sucks:
from xlrd import *
print open_workbook('temp.xlsx')
wb = open_workbook('temp.xlsx')
for s in wb.sheets():
for row in range(s.nrows):
values = []
for col in range(s.ncols):
values.append(s.cell(row,col).value)
print(values[0])
print("%.2f" % values[1])
print'''
i used xlrd, but I am open to using anything. Thank you for your help.

From what I understand of your question, the problem is that you want the output to be a list of lists, and you're not getting such a thing.
And that's because there's nothing in your code that even tries to get such a thing. For each row, you build a list, print out the first value of that list, print out the second value of that list, and then forget the list.
To append each of those row lists to a big list of lists, all you have to do is exactly the same thing you're doing to append each column value to the row lists:
temparray = []
for row in range(s.nrows):
values = []
for col in range(s.ncols):
values.append(s.cell(row,col).value)
temparray.append(values)
From your comment, it looks like what you actually want is not only this, but also grouping the temperatures together by day, and also only adding the second column, rather than all of the values, for each day. Which is not at all what you described in the question. In that case, you shouldn't be looping over the columns at all. What you want is something like this:
days = []
current_day, current_date = [], None
for row in range(s.nrows):
date = s.cell(row, 0)
if date != current_date:
current_day, current_date = [], date
days.append(current_day)
current_day.append(s.cell(row, 2))
This code assumes that the dates are always in sorted order, as they are in your input screenshot.
I would probably structure this differently, building a row iterator to pass to itertools.groupby, but I wanted to keep this as novice-friendly, and as close to your original code, as possible.
Also, I suspect you really don't want this:
[[date1, temp1a, temp1b, temp1c],
[date2, temp2a, temp2b]]
… but rather something like this:
{date1: [temp1a, temp1b, temp1c],
date2: [temp1a, temp1b, temp1c]}
But without knowing what you're intending to do with this info, I can't tell you how best to store it.

If you are looking to keep all the data for the same dates, I might suggest using a dictionary to get a list of the temps for particular dates. Then once you get the dict initialized with your data, you can rearrange how you like. Try something like this after wb=open_workbook('temp.xlsx'):
tmpDict = {}
for s in wb.sheets():
for row in xrange(s.nrows):
try:
tmpDict[s.cell(row, 0)].append(s.cell(row, 2).value)
except KeyError:
tmpDict[s.cell(row, 0)] = [s.cell(row,2).value]
If you print tmpDict, you should get an output like:
{date1: [temp1, temp2, temp3, ...],
date2: [temp1, temp2, temp3, ...]
...}
Dictionary keys are kept in an arbitrary order (it has to do with the hash value of the key) but you can construct a list of lists based on the content of the dict like so:
tmpList = []
for key in sorted(tmpDict.keys):
valList = [key]
valList.extend(tmpDict[key])
tmpList.append(valList)
Then, you'll get a list of lists ordered by date with the vals, as you were originally working. However, you can always get to the values in the dictionary by using the keys. I typically find it easier to work with the data afterwards but you can change it to any form you need.

Related

Store list of dictionaries as a DataFrame

Suppose I have List of dictionaries as
l = [{'car':'good'},
{'mileage':'high'},
{'interior':'stylish'},
{'car':'bad'},
{'engine':'powerful'},
{'safety':'low'}]
Basically these are noun-adjective pairs.
How can I visualize whats the most associated list of adjective to
lets say car here.
How to convert this to Data frame? , I have tried pd.Dataframe(l)
, but here the key is not the column name so gets little bit tricky
here.
Any help would be appreciated.
Given that you want this to be done column-wise, then you have to re-structure your list of dictionaries. You need to have one dictionary to represent one row. Therefore, your example list should be (I added a second row for better explainability):
l = [
{'car':'good','mileage':'high','interior':'stylish','car':'bad','engine':'powerful','safety':'low'}, # row 1
{'car':'bad','mileage':'low','interior':'old','car':'bad','engine':'powerful','safety':'low'} # row 2
]
At this point, all you have to do is call pd.DataFrame(l).
EDIT: Based on your comments, I think you need to convert the dictionary to a list to get your desired result. Here is a quick way (I'm sure it can be much more efficient):
l = [{'car':'good'},
{'mileage':'high'},
{'interior':'stylish'},
{'car':'bad'},
{'engine':'powerful'},
{'safety':'low'}]
new_list = []
for item in l:
for key, value in item.items():
temp = [key,value]
new_list.append(temp)
df = pd.DataFrame(new_list, columns=['Noun', 'Adjective'])
You can construct your DataFrame by giving a list of tuples. To get tuples from a dict use the method items(). Construct the list of tuples with a list comprehension by taking the first tuple of each items.
import pandas as pd
df=pd.DataFrame(data=[d.items()[0] for d in l],columns=['A','B'])
print df
Gives :
A B
0 car good
1 mileage high
2 interior stylish
3 car bad
4 engine powerful
5 safety low

how to append a dataframe to an existing dataframe inside a loop

I made a simple DataFrame named middle_dataframe in python which looks like this and only has one row of data:
display of the existing dataframe
And I want to append a new dataframe generated each time in a loop to this existing dataframe. This is my program:
k = 2
for k in range(2, 32021):
header = whole_seq_data[k]
if header.startswith('>'):
id_name = get_ucsc_ids(header)
(chromosome, start_p, end_p) = get_chr_coordinates_from_string(header)
if whole_seq_data[k + 1].startswith('[ATGC]'):
seq = whole_seq_data[k + 1]
df_temp = pd.DataFrame(
{
"ucsc_id":[id_name],
"chromosome":[chromosome],
"start_position":[start_p],
"end_position":[end_p],
"whole_sequence":[seq]
}
)
middle_dataframe.append(df_temp)
k = k + 2
My iterations in the for loop seems to be fine and I checked the variables that stored the correct value after using regular expression. But the middle_dataframe doesn`t have any changes. And I can not figure out why.
The DataFrame.append method returns the result of the append, rather than appending in-place (link to the official docs on append). The fix should be to replace this line:
middle_dataframe.append(df_temp)
with this:
middle_dataframe = middle_dataframe.append(df_temp)
Depending on how that works with your data, you might need also to pass in the parameter ignore_index=True.
The docs warn that appending one row at a time to a DataFrame can be more computationally intensive than building a python list and converting it into a DataFrame all at once. That's something to look into if your current approach ends up too slow for your purposes.

Output strings to specific rows in excel with python

I have a list MyList[] of integers which I convert to binary strings and output to an excel table.
The code looks like:
r = 2
for x in MyList:
binary_out = bin(x)[2:].zfill(9)
for ind, val in enumerate(str(binary_out)) :
worksheet.write(r,ind+2,val)
r+=1
and the output of binary string looks like:
(Note:This is the binary output of the code above. The other data are generated in an earlier phase)
This is so far OK.
I would like to get the output only on specific rows, not to all.
The information on which rows they have to be output I have only in form of indices which I have earlier collect into a list:
indices = [2,4,6,7]
As you can see above, the output in excel starts from row 2.
So the row 2 shall now mean the number with the first index of indices.
So the output shall looks like:
How to modify the code to get the output on the wanted rows?
Maybe you can use nested for:
for x in MyList:
binary_out = bin(x)[2:].zfill(9)
for row_idx in indices:
for ind, val in enumerate(str(binary_out)):
worksheet.write(row_idx,ind+2,val)
I do not know exactly what library you are using so I don't know what ind and val are.
You can dynamically create list with that indices using list comprehansion:
# for x in MyList:
for x in (MyList[idx] for idx in indices):

How to define a variable amount of columns in python pandas apply

I am trying to add columns to a python pandas df using the apply function.
However the number of columns to be added depend on the output of the function
used in the apply function.
example code:
number_of_columns_to_be_added = 2
def add_columns(number_of_columns_to_be_added):
df['n1'],df['n2'] = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
Any idea on how to define the ugly column part (df['n1'], ..., df['n696969']) before the = zip( ... part programatically?
I'm guessing that the output of zip is a tuple, therefore you could try this:
temp = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
for i, value in enumerate(temp, 1):
key = 'n'+str(i)
df[key] = value
temp will hold the all the entries and then you iterate over tempto assign the values to your dict with your specific keys. Hope this matches your original idea.

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Categories