pandas speed up creation of columns from column of lists - python

I have an original dataset with informations stored as a list of dict, in a column (this is a mongodb extract). This is the column :
[{u'domain_id': ObjectId('A'), u'p': 1},
{u'domain_id': ObjectId('B'), u'p': 2},
{u'domain_id': ObjectId('B'), u'p': 3},
...
{u'domain_id': ObjectId('CG'), u'p': 101}]
I'm only interested in the first 10 dict ( 'p' value from 1 to 10). The output dataframe should look like this :
index | A | ... | B
------------------------
0 | 1 | ... | 2
1 | Nan | ... | Nan
2 | Nan | ... | 3
e.g : For each line of my original DataFrame, I create a column for each domain_id, and I associate it with the corresponding 'p' value. I can have the same domain_id for several 'p' value, in this case I only keep the first one (smaller 'p')
Here is my current code, which may be easier to understand :
first = True
for i in df.index[:]: # for each line of original Dataframe
temp_list = df["positions"][i] # this is the column with the list of dict inside
col_list = []
data_list = []
for j in range(10): # get the first 10 values
try:
if temp_list[j]["domain_id"] not in col_list: # check if domain_id already exist
col_list.append(temp_list[j]["domain_id"])
data_list.append(temp_list[j]["p"])
except IndexError as e:
print e
df_temp = pd.DataFrame([np.transpose(data_list)],columns = col_list) # create a temporary DataFrame for this line of the original DataFrame
if first:
df_kw = df_temp
first = False
else:
# pass
df_kw = pd.concat([df_kw,df_temp], axis=0, ignore_index=True) # concat all the temporary DataFrame : now I have my output Dataframe, with the same number of lines as my original DataFrame.
This is all working fine, but it is very very slow as I have 15k lines and end up with 10k columns.
I'm sure (or at least I hope very much) that there is a simpler an faster solution : any advice will be much appreciated.

I found a decent solution : the slow part is the concatenation, so it is way more efficient to first create the dataframe and then update the values.
Create the DataFrame:
for i in df.index[:]:
temp_list = df["positions"][i]
for j in range(10):
try:
# if temp_list[j]["domain_id"] not in col_list:
col_list.append(temp_list[j]["domain_id"])
except IndexError as e:
print e
df_total = pd.DataFrame(index=df.index, columns=set(col_list))
Update the values :
for i in df.index[:]:
temp_list = df["positions"][i]
col_list = []
for j in range(10):
try:
if temp_list[j]["domain_id"] not in col_list: # avoid overwriting values
df_total.loc[i, temp_list[j]["domain_id"]] = temp_list[j]["p"]
col_list.append(temp_list[j]["domain_id"])
except IndexError as e:
print e
Creating a 15k x 6k DataFrame took about 6 seconds on my computer, and filling it took 27 seconds.
I killed the former solution after more than 1 hour running, so this is really faster.

Related

Python - looping through rows and concating rows until a certain value is encountered

I am getting myself very confused over a problem I am encountering with a short python script I am trying to put together. I am trying to iterate through a dataframe, appending rows to a new dataframe, until a certain value is encountered.
import pandas as pd
#this function will take a raw AGS file (saved as a CSV) and convert to a
#dataframe.
#it will take the AGS CSV and print the top 5 header lines
def AGS_raw(file_loc):
raw_df = pd.read_csv(file_loc)
#print(raw_df.head())
return raw_df
import_df = AGS_raw('test.csv')
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if "**PROJ" == True:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif "**ABBR" == True:
break
print(raw_df)
return cut_df
I don't need to get into specifics, but the values (**PROJ and **ABBR) in this data occur as single cells as the top of tables. So I want to loop row-wise through the data, appending rows until **ABBR is encountered.
When I call AGS_snip(import_df), nothing happens. Previous incarnations just spat out the whole df, and I'm just confused over the logic of the loops. Any assistance much appreciated.
EDIT: raw text of the CSV
**PROJ,
1,32
1,76
32,56
,
**ABBR,
1,32
1,76
32,56
The test CSV looks like this:
The reason that "nothing happens" is likely b/c of the conditions you're using in if and elif.
Neither "**PROJ" == True nor "**ABBR" == True will ever be True because neither "**PROJ" nor "**ABBR" are equal to True. Your code is equivalent to:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if False:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif False:
break
print(raw_df)
return cut_df
Which is the same as:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
You also always return from inside the loop and df_new_row isn't used for anything, so it's equivalent to:
def AGS_snip(raw_df):
first_row = next(raw_df.iterrows(), None)
if first_row:
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
Here's how to parse your CSV file into multiple separate dataframes based on a row condition. Each dataframe is stored in a Python dictionary, with titles as keys and dataframes as values.
import pandas as pd
df = pd.read_csv('ags.csv', header=None)
# Drop rows which consist of all NaN (Not a Number) / missing values.
# Reset index order from 0 to the end of dataframe.
df = df.dropna(axis='rows', how='all').reset_index(drop=True)
# Grab indices of rows beginning with "**", and append an "end" index.
idx = df.index[df[0].str.startswith('**')].append(pd.Index([len(df)]))
# Dictionary of { dataframe titles : dataframes }.
dfs = {}
for k in range(len(idx) - 1):
table_name = df.iloc[idx[k],0]
dfs[table_name] = df.iloc[idx[k]+1:idx[k+1]].reset_index(drop=True)
# Print the titles and tables.
for k,v in dfs.items():
print(k)
print(v)
# **PROJ
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# **ABBR
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# Access each dataframe by indexing the dictionary "dfs", for example:
print(dfs['**ABBR'])
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# You can rename column names with for example this code:
dfs['**PROJ'].set_axis(['data1', 'data2'], axis='columns', inplace=True)
print(dfs['**PROJ'])
# data1 data2
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0

How to append a new value in a CSV file in Python?

I have a CSV sheet, having data like this:
| not used | Day 1 | Day 2 |
| Person 1 | Score | Score |
| Person 2 | Score | Score |
But with a lot more rows and columns. Every day I get progress of how much each person progressed, and I get that data as a dictionary where keys are names and values are score amounts.
The thing is, sometimes that dictionary will include new people and not include already existing ones. Then, if a new person comes, it will add 0 as every previous day and if the dict doesn't include already existing person, it will give him 0 score to that day
My idea of solving this is doing lines = file.readlines() on that CSV file, making a new list of people's names with
for line in lines:
names.append(line.split(",")[0])
then making a copy of lines (newLines = lines)
and going through dict's keys, seeing if that person is already in the csv, if so, append the value followed by a comma
But I'm stuck at the part of adding score of 0
Any help or contributions would be appreciated
EXAMPLE: Before I will have this
-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
And I have this dictionary to add
{'Mark': 1750, 'Hannah':1640, 'Brian':1780}
The result should be
-,day1,day2,day3,day4
Mark,1500,0,1660,1750
John,1800,1640,0,0
Peter,1670,1680,1630,0
Hannah,1480,1520,1570,1640
Brian,0,0,0,1780
See how Brian is in the dict and not in the before csv and he got added with any other day score 0. I figured out that one line .split(',') would give a list of N elements, where N - 2 will be amount of zero scores to add prior to first day of that person
This is easy to do in pandas as an outer join. Read the CSV into a dataframe and generate a new dataframe from the dictionary. The join is almost what you want except that since not-a-number values are inserted for empty cells, you need to fill the NaN's with zero and reconvert everything to integer.
The one potential problem is that the CSV is sorted. You don't simply have the new rows appended to the bottom.
import pandas as pd
import errno
import os
INDEX_COL = "-"
def add_days_score(filename, colname, scores):
try:
df = pd.read_csv(filename, index_col=INDEX_COL)
except OSError as e:
if e.errno == errno.ENOENT:
# file doesn't exist, create empty df
df = pd.DataFrame([], columns=[INDEX_COL])
df = df.set_index(INDEX_COl)
else:
raise
new_df = pd.DataFrame.from_dict({colname:scores})
merged = df.join(new_df, how="outer").fillna(0).astype(int)
try:
merged.to_csv(filename + ".tmp", index_label=[INDEX_COL])
except:
raise
else:
os.rename(filename + ".tmp", filename)
return merged
#============================================================================
# TEST
#============================================================================
test_file = "this_is_a_test.csv"
before = """-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
"""
after = """-,day1,day2,day3,day4
Brian,0,0,0,1780
Hannah,1480,1520,1570,1640
John,1800,1640,0,0
Mark,1500,0,1660,1750
Peter,1670,1680,1630,0
"""
test_dicts = [
["day4", {'Mark': 1750, 'Hannah':1640, 'Brian':1780}],
]
open(test_file, "w").write(before)
for name, scores in test_dicts:
add_days_score(test_file, name, scores)
print("want\n", after, "\n")
got = open(test_file).read()
print("got\n", got, "\n")
if got != after:
print("FAILED")

How to make Slices from a Dataframe where Column Equals a Value

I have two sets of csv data. One contains two columns (time and a boolean flag) and another data set which contains some info I have some graphing functions Id like to visually display. The data is sampled at different frequencies so the number of rows may not match for the datasets. How do I plot individual graphs for a range of data where the boolean is true?
Here is what the contact data looks like:
INDEX | TIME | CONTACT
0 | 240:18:59:31.750 | 0
1 | 240:18:59:32.000 | 0
2 | 240:18:59:32.250 | 0
........
1421 | 240:19:05:27.000 | 1
1422 | 240:19:05:27.250 | 1
The other (Vehicle) data isnt really important but contains values like Weight, Speed (MPH), Pedal Position etc.
I have many seperate large excel files and because the shapes do not match I am unsure how to slice the data using the time flags so I made a function below to create the ranges but I am thinking this can be done in an easier manner.
Here is the working code (with output below). In short, is there an easier way to do this?
def determineContactSlices(data):
contactStart = None
contactEnd = None
slices = pd.DataFrame([])
for index, row in data.iterrows():
if row['CONTACT'] == 1:
# begin slice
if contactStart is None:
contactStart = index
continue
else:
# still valid, move onto next
continue
elif row['CONTACT'] == 0:
if contactStart is not None:
contactEnd = index - 1
# create slice and add the df to list
slice = data[contactStart:contactEnd]
print(slice)
slices = slices.append(slice)
# then reset everything
slice = None
contactStart = None
contactEnd = None
continue
else:
# move onto next row
continue
return slices
Output: ([15542 rows x 2 columns])
Index Time CONTACT
1421 240:19:05:27.000 1
1422 240:19:05:27.250 1
1423 240:19:05:27.500 1
1424 240:19:05:27.750 1
1425 240:19:05:28.000 1
1426 240:19:05:28.250 1
... ...
56815 240:22:56:15.500 1
56816 240:22:56:15.750 1
56817 240:22:56:16.000 1
56818 240:22:56:16.250 1
56819 240:22:56:16.500 1
With this output I intend to loop through each time slice and display the Vehicle Data in subplots.
Any help or guidance would be much appreciated (:
UPDATE:
I believe I can just do filteredData = vehicleData[contactData['CONTACT'] == 1] but then I am faced with how to go about graphing individually when there is a disconnect. For example if there are 7 connections at various times and lengths, I woud like to have 7 individual plots to graph.
I think what you are trying to do is relatively simple, although I am not sure if I understand the output that you want or what you want to do with it after you have it. For example:
contact_df = data[data['CONTACT'] == 1]
non_contact_df = data[data['CONTACT'] == 0]
If this isn't helpful, please provide some additional details as to what the output should look like and what you plan to do with it after it is created.
Old question but why not:
sliceStart_index = df[ df["date"]=="2012-12-28" ].index.tolist()[0]
sliceEnd_index = df[ df["date"]=="2013-01-10" ].index.tolist()[0]
this_is_your_slice = df.iloc[sliceStart_index : sliceEnd_index]
first two lines actually get you a list of indexes where the condition is met, I just chose the first ones for example.

Dictionary of dataframes not saving all

I've been trying to create a dictionary of data frames so I can store data coming from different files. I have one dataframe created in the following loop, and I would like to aggregate them to have each dataframe to the dictionary. I will have to join them later by the date.
d = {}
for num in range(3,14):
nodeName = "rgs" + str(num).zfill(2) #The key should be the nodeName
# Bunch of stuff to get the data ...
# Fill dataframe
data = {'date':date_list, 'users':users_list}
df = pd.DataFrame(data)
df = df.convert_objects(convert_numeric=True)
df = df.dropna(subset=['users'])
df['users'] = df['users'].astype(int)
d = {nodeName:df}
print d
The problem that I have is, if I print the dictionary out of the loop I only have one item, the last one.
{'rgs13': date users
0 2016-01-18 1
1 2016-01-19 1
2 2016-01-20 1
3 2016-01-21 1
4 2016-01-22 1
5 2016-01-23 1
6 2016-01-24 0
But I can clearly see that I can generate all the dataframes without problems inside the loop. How can I make the dictionary to keep all the df's? What am I doing wrong?
Thanks for the help.
It's because in the end you are re-defining d.
What you want is this:
d = {}
for num in range(3,14):
nodeName = "rgs" + str(num).zfill(2) #The key should be the nodeName
# Bunch of stuff to get the data ...
# Fill dataframe
data = {'date':date_list, 'users':users_list}
df = pd.DataFrame(data)
df = df.convert_objects(convert_numeric=True)
df = df.dropna(subset=['users'])
df['users'] = df['users'].astype(int)
d[nodeName] = df
print d
Instead of d = {nodeName:df} use
d[nodeName] = df
Since this adds a key/value pair to d whereas d = {nodeName:df} reassigns d to a new dict (with only the one key/value pair). Doing that in a loop spells death to all the previous key/value pairs.
You may find Ned Batchelder's Facts and myths about Python names and values a useful read. It will give you the right mental model for thinking about the relationship between variable names and values, and help you see what statements modify values (e.g. d[nodeName] = df) versus reassign variable names (e.g. d = {nodeName:df}).

Pandas joining based on date

I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.
df1:
group date teacher
a 1/10/00 1
a 2/27/00 1
b 1/7/00 1
b 4/5/00 1
c 2/9/00 2
c 9/12/00 2
df2:
teacher date hair length
1 1/1/00 4
1 1/5/00 8
1 1/30/00 20
1 3/20/00 100
2 1/1/00 0
2 8/10/00 50
Gives us:
group date teacher hair length
a 1/10/00 1 8
a 2/27/00 1 20
b 1/7/00 1 8
b 4/5/00 1 100
c 2/9/00 2 0
c 9/12/00 2 50
Edit 1:
Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.
One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier:
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases.
This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger:
# Assuming df1 and df2 are sorted by the dates
df1['hair length'] = 0 # initialize
r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)
for i, l_row in df1.iterrows():
cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2
while cur_r_row['date'] <= l_row['date']:
cur_hair_length = cur_r_row['hair length']
try:
_, cur_r_row = next(r_generator)
except StopIteration:
break
df1.loc[i, 'hair length'] = cur_hair_length
Seems like the quickest way to do this is using sqlite via pysqldf:
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):
try:
tablea_group, tablea_date = tablea_keys
tableb_group, tableb_date = tableb_keys
except ValueError, e:
raise(e, 'Need to pass in both a group and date key for both tables')
# Note: can't actually use group here as a field name due to sqlite
statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
MAX(tableb.{date_b}) AS tdate
FROM tablea
JOIN tableb
ON tablea.{group_a}=tableb.{group_b}
AND tablea.{date_a}>=tableb.{date_b}
GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
) AS a
JOIN tableb b
ON a.{group_a}=b.{group_b}
AND a.tdate=b.{date_b};
""".format(group_a=tablea_group, date_a=tablea_date,
group_b=tableb_group, date_b=tableb_date,
temp_date='join_date', base_id=base_id)
# Note: you lose types here for tableb so you may want to save them
pre_join_tableb = sqldf(statement, locals())
return pd.merge(tablea, pre_join_tableb, how='inner',
left_on=['group'] + tablea_keys,
right_on=['group', tableb_group, 'join_date'])

Categories