Appending CSV files, matching unordered columns - python

Problem: matching columns while appending CSV files
I have 50 .csv files where each column is a word, each row is a time of day and each file holds all words for one day. They look like this:
Date Time Aword Bword Cword Dword
Date1 t1 0 1 0 12
Date1 t2 0 6 3 0
Date Time Eword Fword Gword Hword Bword
Date2 t1 0 0 1 0 3
Date2 t2 2 0 0 19 0
I want to append the files so that any columns with the same word (like Bword in this example) are matched while new words are added in new columns:
Date Time Aword Bword Cword Dword Eword Fword Gword Hword
Date1 t1 0 1 0 12
Date1 t2 0 6 3 0
Date2 t1 3 0 0 1 0
Date2 t2 0 2 0 0 19
I'm opening the csv files as dataframes to manipulate them and using dataframe.append the new files are added like this:
Date Time Aword Bword Cword Dword
Date1 t1 0 1 0 12
Date1 t2 0 6 3 0
Date Time Eword Fword Gword Hword Bword
Date2 t1 0 0 1 0 3
Dat2e t2 2 0 0 19 0
Is there a different approach which could align matching columns while appending? ie without iterating through each column and checking for matches.
Sincere apologies if this question is too vague, I'm new to python and still struggling to know when I'm thinking un-pythonically and when I'm using the wrong tools.
EDIT: more information
1) I'll need to perform this task multiple times, once for each of five batches of csvs
2) The files all have 25 rows but have anything from 5 to 294 columns
3) The order of rows is important Day1(t1, t2...tn) then Day2(t1, t2...tn)
4) The order of columns is not important

IIUC, you can simply use pd.concat, which will automatically align on columns:
>>> csvs = glob.glob("*.csv")
>>> dfs = [pd.read_csv(csv) for csv in csvs]
>>> df_merged = pd.concat(dfs).fillna("")
>>> df_merged
Aword Bword Cword Date Dword Eword Fword Gword Hword Time
0 0 1 0 Date1 12 t1
1 0 6 3 Date1 0 t2
0 3 Date2 0 0 1 0 t1
1 0 Date2 2 0 0 19 t2
(Although I'd recommend either fillna(0) or leaving it as nan; if you fill with an empty string to look like your desired output, the column has to have object dtype and those are much slower than int or float.)
If you're really particular about the column order, you could cheat and use (re)set_index:
>>> df_merged.set_index(["Date", "Time"]).reset_index()
Date Time Aword Bword Cword Dword Eword Fword Gword Hword
0 Date1 t1 0 1 0 12
1 Date1 t2 0 6 3 0
2 Date2 t1 3 0 0 1 0
3 Date2 t2 0 2 0 0 19

I think for this kind of thing you might find using the pandas library a bit easier. Say filelist is a list of file names.
import pandas as pd
df = pd.concat([pd.read_csv(fl, index_col=[0,1]) for fl in filelist])
And you're done! As a side note if you'd like to combine the date and time columns (depending on their format) you can try
df = pd.concat([pd.read_csv(fl, parse_dates=['Date','Time']) for fl in filelist]).drop('Date', axis=1)

If the order of rows and columns is not important (if it is, you need to edit your Q to specify how to deal with it when the order differs among files!), there are no conflicts (different values of the same column for the same date and time), and the data fit in memory -- and you prefer to work in Python than in Pandas (I notice you haven't tagged your Q with pandas) -- one approach might be the following:
import collections
import csv
def merge_csvs(*filenames):
result_dict = collections.defaultdict(dict)
all_columns = set()
for fn in filenames:
with open(fn) as f:
dr = csv.DictReader(f)
update_cols = True
for row in dr:
date = row.pop('Date')
time = row.pop('Time')
result_dict[date, time].update(row)
if update_cols:
all_columns.update(row)
update_cols = False
for d in result_dict:
missing_cols = all_columns.difference(d)
d.update(dict.from_keys(missing_cols, '')
return result_dict
This produced a dictionary, keyed by (date, time) pairs, of dictionaries whose keys are all the columns found in any of the input CSVs, with either the corresponding value for that date and time, or else an empty string if that column was never found for that date and time.
Now you can deal with this as you wish, e.g
d = merge_csvs('a.csv', 'b.csv', 'c.csv')
for date, time in sorted(d):
dd = d[date, time]
outlist = [dd[c] for c in sorted(dd)]
print(date, time, outlist)
or, of course, write it back to a different CSV, and so forth.

Related

Pandas dataframe column value dependent on dynamic number of rows

I've got a dataframe that looks something like this:
user
current_date
prior_date
points_scored
1
2021-01-01
2020-10-01
5
2
2021-01-01
2020-10-01
4
2
2021-01-21
2020-10-21
4
2
2021-05-01
2021-02-01
4
The prior_date column is simply current_date - 3 months and points_scored is the number of points scored on current_date. I'd like to be able to identify which rows had sum(points_scored) >= 8 where for a given user, the rows considered would be where current_date is between current_date and prior_date. It is guaranteed that no single row will have a value of points_scored >= 8.
For example, in the example above, I'd like something like this returned:
user
current_date
prior_date
points_scored
flag
1
2021-01-01
2021-04-01
5
0
2
2021-01-01
2020-10-01
4
0
2
2021-01-21
2020-10-21
4
1
2
2021-05-01
2021-02-01
4
0
The third row shows flag=1 because for row 3's values of current_date=2021-01-21 and prior_date=2020-10-21, the rows to consider would be rows 2 and 3. We consider row 2 because row 2's current_date=2021-01-01 which is between row 3's current_date and prior_date.
Ultimately, I'd like to end up with a data structure where it shows distinct user and flag. It could be a dataframe or a dictionary-- anything easily referencable.
user
flag
1
0
2
1
To do this, I'm doing something like this:
flags = {}
ids = list(df['user'].value_counts()[df['user'].value_counts() > 2].index)
for id in ids:
temp_df = df[df['user'] == id]
for idx, row in temp_df.iterrows():
cur_date = row['current_date']
prior_date = row['prior_date']
temp_total = temp_df[(temp_df['current_date'] <= cur_date) & (cand_df['current_date'] >= prior_date)]['points_scored'].sum()
if temp_total >= 8:
flags[id] = 1
break
The code above works, but just takes way too long to actually execute.
You are right, performing loops on large data can be quite time consuming. This is where the power of numpy comes into full play. I am still not sure of what you want but i can help address the speed
Numpy.select can perform your if else statement efficiently.
import pandas as pd
import numpy as np
condition = [df['points_scored']==5, df['points_scored']==4, df['points_scored'] ==3] # <-- put your condition here
choices = ['okay', 'hmmm!', 'yes'] #<--what you want returned (the order is important)
np.select(condition,choices,default= 'default value')
Also, you might want to more succint what you want. meanwhile you can refactor your loops with np.select()

Efficient way to add many rows to a DataFrame

I really want to speed up my code.
My already working code loops through a DataFrame and gets the start and end year. Then I add it to the lists. At the end of the loop, I append to the empty DataFrame.
rows = range(3560)
#initiate lists and dataframe
start_year = []
end_year = []
for i in rows:
start_year.append(i)
end_year.append(i)
df = pd.DataFrame({'Start date':start_year, 'End date':end_year})
I get what I expect, but very slowly:
Start date End date
0 1 1
1 2 2
2 3 3
3 4 4
Yes, it can be made faster. The trick is to avoid list.append (or, worse pd.DataFrame.append) in a loop. You can use list(range(3560)), but you may find np.arange even more efficient. Here you can assign an array to multiple series via dict.fromkeys:
df = pd.DataFrame(dict.fromkeys(['Start date', 'End date'], np.arange(3560)))
print(df.shape)
# (3560, 2)
print(df.head())
# Start date End date
# 0 0 0
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4

Calculating date differences in Excel data

I have a scenario where I have to read an Excel file and calculate the date difference for each status and store the output in another Excel file.
date name status
1/15/2017 ABC insert_start
1/16/2017 ABC insert_complete
1/17/2017 DEF remove_start
1/18/2017 DEF remove_complete
1/19/2017 GHI create_start
1/20/2017 GHI create_complete
I need the output in the following format:
name created inserted removed
ABC 0 1 0
DEF 0 0 1
GHI 1 0 0
Where value 1 is the date difference of ABC to complete insert status.
Any help would be greatly appreciated.
Let's say df is the dataframe created by loading the excel file (which looks like the one in your example). You might have loaded it with
df = pd.read_csv('foo.csv', sep='\s+', parse_dates=['date'])
Now, you can do this:
pivoted = df.pivot('name', 'status').fillna(0)
ops = ("create", "insert", "remove")
result = pd.concat([ pivoted['date', op + '_complete']
- pivoted['date', op + '_start']
for op in ops], axis=1)
result.columns = ops
# create insert remove
#name
#ABC 0 days 1 days 0 days
#DEF 0 days 0 days 1 days
#GHI 1 days 0 days 0 days

Python Pandas dataframe

I have one dataframe (df1) like the following:
ATime ETime Difference
0 1444911017815 1588510 1444909429305
1 1444911144979 1715672 1444909429307
2 1444911285683 1856374 1444909429309
3 1444911432742 2003430 1444909429312
4 1444911677101 2247786 1444909429315
5 1444912444821 3015493 1444909429328
6 1444913394542 3965199 1444909429343
7 1444913844134 4414784 1444909429350
8 1444914948835 5519467 1444909429368
9 1444915840638 6411255 1444909429383
10 1444916566634 7137240 1444909429394
11 1444917379593 7950186 1444909429407
I have another very big dataframe (df2) which has a column named Absolute_Time. Absolute_Time has the format as ATime of df1. So what I want to do is, for example, for all Absolute_Time's that lay in the range of row 0 to row 1 of ETime of df1, I want to subtract row 0 of Difference of df1 and so on.
Here's an attempt to accomplish what you might be looking for, starting with:
print(df1)
ATime ETime Difference
0 1444911017815 1588510 1444909429305
1 1444911144979 1715672 1444909429307
2 1444911285683 1856374 1444909429309
3 1444911432742 2003430 1444909429312
4 1444911677101 2247786 1444909429315
5 1444912444821 3015493 1444909429328
6 1444913394542 3965199 1444909429343
7 1444913844134 4414784 1444909429350
8 1444914948835 5519467 1444909429368
9 1444915840638 6411255 1444909429383
10 1444916566634 7137240 1444909429394
11 1444917379593 7950186 1444909429407
next creating a new DataFrame with random times within the range of df1:
df2 = pd.DataFrame({'Absolute Time':[randrange(start=df1.ATime.iloc[0], stop=df1.ATime.iloc[-1]) for i in range(100)]})
df2 = df2.sort_values('Absolute Time').reset_index(drop=True)
np.searchsorted provides you with the index positions where df2 should be inserted in df1 (for the columns in question):
df2.index = np.searchsorted(df1.ATime.values, df2.loc[:, 'Absolute Time'].values)
Assigning the new index and merging produces a new DataFrame. Filling the missing Difference values forward allows to subtract in the next step:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='left').fillna(method='ffill').dropna().astype(int)
df['Absolute Time Adjusted'] = df['Absolute Time'].sub(df.Difference)
print(df.head())
ATime ETime Difference Absolute Time \
1 1444911144979 1715672 1444909429307 1444911018916
1 1444911144979 1715672 1444909429307 1444911138087
2 1444911285683 1856374 1444909429309 1444911138087
3 1444911432742 2003430 1444909429312 1444911303233
3 1444911432742 2003430 1444909429312 1444911359690
Absolute Time Adjusted
1 1589609
1 1708780
2 1708778
3 1873921
3 1930378

Aggregating unbalanced panel to time series using pandas

I have an unbalanced panel that I'm trying to aggregate up to a regular, weekly time series. The panel looks as follows:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
To give a better sense of what I'm looking for, I'm including an intermediate step, which I'd love to skip if possible. Basically some data needs to be filled in so that it can be aggregated. As you can see, missing weeks in between observations are interpolated. All other values are set equal to zero.
Group Date value
A 1/1/2000 5
A 1/8/2000 5
A 1/15/2000 10
A 1/22/2000 0
B 1/1/2000 0
B 1/8/2000 3
B 1/15/2000 3
B 1/22/2000 7
C 1/1/2000 0
C 1/8/2000 0
C 1/15/2000 0
C 1/22/2000 20
The final result that I'm looking for is as follows:
Date value
1/1/2000 5 = 5 + 0 + 0
1/8/2000 8 = 5 + 3 + 0
1/15/2000 13 = 10 + 3 + 0
1/22/2000 27 = 0 + 7 + 20
I haven't gotten very far, managed to create a panel:
panel = df.set_index(['Group','week']).to_panel()
Unfortunately, if I try to resample, I get an error
panel.resample('W')
TypeError: Only valid with DatetimeIndex or PeriodIndex
Assume df is your second dataframe with weeks, you can try the following:
df.groupby('week').sum()['value']
The documentation of groupby() and its application is here. It's similar to group-by function in SQL.
To obtain the second dataframe from the first one, try the following:
Firstly, prepare a function to map the day to week
def d2w_map(day):
if day <=7:
return 1
elif day <= 14:
return 2
elif day <= 21:
return 3
else:
return 4
In the method above, days from 29 to 31 are considered in week 4. But you get the idea. You can modify it as needed.
Secondly, take the lists out from the first dataframe, and convert days to weeks
df['Week'] = df['Day'].apply(d2w_map)
del df['Day']
Thirdly, initialize your second dataframe with only columns of 'Group' and 'Week', leaving the 'value' out. Assume now your initialized new dataframe is result, you can now do a join
result = result.join(df, on=['Group', 'Week'])
Last, write a function to fill the Nan up in the 'value' column with the nearby element. The Nan is what you need to interpolate. Since I am not sure how you want the interpolation to work, I will leave it to you.
Here is how you can change d2w_map to convert string of date to integer of week
from datetime import datetime
def d2w_map(day_str):
return datetime.strptime(day_str, '%m/%d/%Y').weekday()
Returned value of 0 means Monday, 1 means Tuesday and so on.
If you have the package dateutil installed, the function can be more robust:
from dateutil.parser import parse
def d2w_map(day_str):
return parse(day_str).weekday()
Sometimes, things you want are already implemented by magic :)
Turns out the key is to resample a groupby object like so:
df_temp = (df.set_index('date')
.groupby('Group')
.resample('W', how='sum', fill_method='ffill'))
ts = (df_temp.reset_index()
.groupby('date')
.sum()[value])
Used this tab delimited test.txt:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
You can skip the intermediate datafile as follows. Don't have time now. Just play around with it to get it right.
import pandas as pd
import datetime
time_format = '%m/%d/%Y'
Y = pd.read_csv('test.txt', sep="\t")
dates = Y['Date']
dates_right_format = map(lambda s: datetime.datetime.strptime(s, time_format), dates)
values = Y['value']
X = pd.DataFrame(values)
X.index = dates_right_format
print X
X = X.sort()
print X
print X.resample('W', how=sum, closed='right', label='right')
Last print
value
2000-01-02 5
2000-01-09 3
2000-01-16 NaN
2000-01-23 37

Categories