I have a csv file that is tab delimited.
Example:
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts State ES DPt Time
1 0 1 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 3.41214609 R 0 09:44:13
2 0 1 30.00000000 30.00000000 0.00000000 0.00000000 0.00000000 3.41077280 R 1 09:44:43
3 0 1 60.00000000 60.00000000 0.00000000 0.00000000 0.00000000 3.41077280 R 1 09:45:13
I read the csv in using:
import pandas as pd
df = pd.read_csv('foo.csv', sep='\t')
This gives the output:
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts State ES DPt Time
1 0 1 0.00 0.00 0.000000 0.000000 0.000000 3.412146 R 0 09:44:13 NaN
2 0 1 30.00 30.00 0.000000 0.000000 0.000000 3.410773 R 1 09:44:43 NaN
3 0 1 60.00 60.00 0.000000 0.000000 0.000000 3.410773 R 1 09:45:13 NaN
This seems to have shifted my column names over by one and causes my last column to be filled with NAN's instead of dates.
If I do the following:
import pandas as pd
df = pd.read_csv("foo.csv", sep="\t")
df = pd.read_csv("foo.csv", sep="\t", usecols=df[:len(df.columns)])
I get the following output:
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts State ES DPt Time
1 1 0 1 0.00 0.00 0.000000 0.000000 0.000000 3.412146 R 0 09:44:13
2 2 0 1 30.00 30.00 0.000000 0.000000 0.000000 3.410773 R 1 09:44:43
3 3 0 1 60.00 60.00 0.000000 0.000000 0.000000 3.410773 R 1 09:45:13
Also if I try to just grab two specific columns it seems to grab those correctly. As in df = df = pd.read_csv("foo.csv", sep="\t", usecols=[3, 8]) will correctly grab the Time (Sec) column and the Volts Column.
I was hoping there was a way to correctly frame the data that wouldn't require me reading it twice.
Thanks in advance!
Oniwa
It looks like there are some trailing tabs:
>>> with open("oniwa.dat") as fp:
... for line in fp:
... print(repr(line))
...
'Rec#\tCyc#\tStep\tTest (Sec)\tStep (Sec)\tAmp-hr\tWatt-hr\tAmps\tVolts\tState\tES\tDPt Time\n'
'1\t0\t1\t0.00000000\t0.00000000\t0.00000000\t0.00000000\t0.00000000\t3.41214609\tR\t0\t09:44:13\t\n'
'2\t0\t1\t30.00000000\t30.00000000\t0.00000000\t0.00000000\t0.00000000\t3.41077280\tR\t1\t09:44:43\t\n'
'3\t0\t1\t60.00000000\t60.00000000\t0.00000000\t0.00000000\t0.00000000\t3.41077280\tR\t1\t09:45:13\n'
As a result, pandas concludes there's an index column. We can tell it otherwise using index_col. To be specific, instead of
>>> pd.read_csv("oniwa.dat", sep="\t") # no good
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts \
1 0 1 0 0 0 0 0 3.412146 R
2 0 1 30 30 0 0 0 3.410773 R
3 0 1 60 60 0 0 0 3.410773 R
State ES DPt Time
1 0 09:44:13 NaN
2 1 09:44:43 NaN
3 1 09:45:13 NaN
we can use
>>> pd.read_csv("oniwa.dat", sep="\t", index_col=False) # hooray!
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts \
0 1 0 1 0 0 0 0 0 3.412146
1 2 0 1 30 30 0 0 0 3.410773
2 3 0 1 60 60 0 0 0 3.410773
State ES DPt Time
0 R 0 09:44:13
1 R 1 09:44:43
2 R 1 09:45:13
Related
I Have a dataframe as follows:
df = pd.DataFrame({'Key':[1,1,1,1,2,2,2,4,4,4,5,5],
'Activity':['A','A','H','B','B','H','H','A','C','H','H','B'],
'Date':['2022-12-03','2022-12-04','2022-12-06','2022-12-08','2022-12-03','2022-12-06','2022-12-10','2022-12-03','2022-12-04','2022-12-07','2022-12-03','2022-12-13']})
I need to count the activities for each 'Key' that occur before 'Activity' == 'H' as follows:
Required Output
My Approach
Sort df by Key & Date ( Sample input is already sorted)
drop the rows that occur after 'H' Activity in each group as follows:
Groupby df.groupby(['Key', 'Activity']).count()
Is there a better approach , if not then help me in code for dropping the rows that occur after 'H' Activity in each group.
Thanks in advance !
You can bring the H dates "back" into each previous row to use in a comparison.
First mark each H date in a new column:
df.loc[df["Activity"] == "H" , "End"] = df["Date"]
Key Activity Date End
0 1 A 2022-12-03 NaT
1 1 A 2022-12-04 NaT
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 NaT
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 NaT
8 4 C 2022-12-04 NaT
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
Backward fill the new column for each group:
df["End"] = df.groupby("Key")["End"].bfill()
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 2022-12-06
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
You can then select rows with Date before End
df.loc[df["Date"] < df["End"]]
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
4 2 B 2022-12-03 2022-12-06
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
To generate the final form - you can use .pivot_table()
(df.loc[df["Date"] < df["End"]]
.pivot_table(index="Key", columns="Activity", values="Date", aggfunc="count")
.reindex(df["Key"].unique()) # Add in keys with no match e.g. `5`
.fillna(0)
.astype(int))
Activity A B C
Key
1 2 0 0
2 0 1 0
4 1 0 1
5 0 0 0
Try this:
(df.loc[df['Activity'].eq('H').groupby(df['Key']).cumsum().eq(0)]
.set_index('Key')['Activity']
.str.get_dummies()
.groupby(level=0).sum()
.reindex(df['Key'].unique(),fill_value=0)
.reset_index())
Output:
Key A B C
0 1 2 0 0
1 2 0 1 0
2 4 1 0 1
3 5 0 0 0
You can try:
# sort by Key and Date
df.sort_values(['Key', 'Date'], inplace=True)
# this is to keep Key in the result when no values are kept after the filter
df.Key = df.Key.astype('category')
# filter all rows after the 1st H for each Key and then pivot
df[~df.Activity.eq('H').groupby(df.Key).cummax()].pivot_table(
index='Key', columns='Activity', aggfunc='size'
).reset_index()
#Activity Key A B C
#0 1 2 0 0
#1 2 0 1 0
#2 4 1 0 1
#3 5 0 0 0
I have an existing dataframe which looks like:
id start_date end_date
0 1 20170601 20210531
1 2 20181001 20220930
2 3 20150101 20190228
3 4 20171101 20211031
I am trying to add 85 columns to this dataframe which are:
if the month/year (looping on start_date to end_date) lie between 20120101 and 20190101: 1
else: 0
I tried the following method:
start, end = [datetime.strptime(_, "%Y%m%d") for _ in ['20120101', '20190201']]
global_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
def get_count(contract_start_date, contract_end_date):
start, end = [datetime.strptime(_, "%Y%m%d") for _ in [contract_start_date, contract_end_date]]
current_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
temp_list = []
for each in global_list:
if each in current_list:
temp_list.append(1)
else:
temp_list.append(0)
return pd.Series(temp_list)
sample_df[global_list] = sample_df[['contract_start_date', 'contract_end_date']].apply(lambda x: get_count(*x), axis=1)
and the sample df looks like:
customer_id contract_start_date contract_end_date 01/12 02/12 03/12 04/12 05/12 06/12 07/12 ... 04/18 05/18 06/18 07/18 08/18 09/18 10/18 11/18 12/18 01/19
1 1 20181001 20220930 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 1 1
9 2 20160701 20200731 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 3 20171101 20211031 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 rows × 88 columns
it works fine for small dataset but for 160k rows it didn't stopped even after 3 hours. Can someone tell me a better way to do this?
Facing problems when the dates overlap for same customer.
First I'd cut off the dud dates, to normalize the end_time (to ensure it's in the time range):
In [11]: df.end_date = df.end_date.where(df.end_date < '2019-02-01', pd.Timestamp('2019-01-31')) + pd.offsets.MonthBegin()
In [12]: df
Out[12]:
id start_date end_date
0 1 2017-06-01 2019-02-01
1 2 2018-10-01 2019-02-01
2 3 2015-01-01 2019-02-01
3 4 2017-11-01 2019-02-01
Note: you'll need to do the same trick for start_date if there are dates prior to 2012.
I'd create the resulting DataFrame from a date range of the columns and then fill it in (with ones at start time and something else:
In [13]: m = pd.date_range('2012-01-01', '2019-02-01', freq='MS')
In [14]: res = pd.DataFrame(0., columns=m, index=df.index)
In [15]: res.update(pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.start_date).groupby(axis=1, level=0).sum())
In [16]: res.update(-pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.end_date).groupby(axis=1, level=0).sum())
The groupby sum is required if multiple rows start or end in the same month.
# -1 and NaN were really placeholders for zero
In [17]: res = res.replace(0, np.nan).ffill(axis=1).replace([np.nan, -1], 0)
In [18]: res
Out[18]:
2012-01-01 2012-02-01 2012-03-01 2012-04-01 2012-05-01 ... 2018-09-01 2018-10-01 2018-11-01 2018-12-01 2019-01-01
0 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 1.0 1.0 1.0
2 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
I have a several question.
First, I want to datetime in pandas dataframe.
like this... 2018/03/06 00:01:27:744
How can I replace this datetime?
And Then.. Second..
Time Sensor1 Sensor2 TimeCumsum
2018/03/06 00:01:27:744 0 1
2018/03/06 00:01:27:759 0 1
2018/03/06 00:01:27:806 0 1 0.15
2018/03/06 00:01:27:838 1 1
2018/03/06 00:01:28:009 1 1 0.2
2018/03/06 00:01:28:056 1 0 ...
I wanna Time Seconds cumsum when Sensor1 is 0 And Sensor2 is 1.
How can I do this?
Thanks.
I believe need:
df['Time'] = pd.to_datetime(df['Time'], format='%Y/%m/%d %H:%M:%S:%f')
m = (df['Sensor1'].eq(0) & df['Sensor2'].eq(1))
df['col'] = df.loc[m, 'Time'].dt.microsecond.cumsum() // 10**3
print (df)
Time Sensor1 Sensor2 col
0 2018-03-06 00:01:27.744 0 1 744.0
1 2018-03-06 00:01:27.759 0 1 1503.0
2 2018-03-06 00:01:27.806 0 1 2309.0
3 2018-03-06 00:01:27.838 1 1 NaN
4 2018-03-06 00:01:28.009 1 1 NaN
5 2018-03-06 00:01:28.056 1 0 NaN
I'm working with data that look something like this:
ID PATH GROUP
11937 MM-YT-UJ-OO GT
11938 YT-RY-LM TQ
11939 XX-XX-OT DX
I'd like to tokenize the PATH column into n-grams and then one-hot encode those into their own columns so I'd end up with something like:
ID GROUP MM YT UJ OO RY LM XX OT MM-YT YT-UH ...
11937 GT 1 1 1 1 0 0 0 0 1 1
I could also use counted tokens rather than one-hot, so 11939 would have a 2 in the XX column instead of a 1, but I can work with either.
I can tokenize the column quite easily with scikitlearn CountVectorizer, but then I have to cbind the ID and GROUP fields. Is there a standard way to do this or a best practice that anyone has discovered?
A solution:
df.set_index(['ID', 'GROUP'], inplace=True)
pd.get_dummies(df.PATH.str.split('-', expand=True).stack())\
.groupby(level=[0,1]).sum().reset_index()
Isolate the ID and GROUP columns as index. Then convert the string to cell items
df.PATH.str.split('-', expand=True)
Out[37]:
0 1 2 3
ID GROUP
11937 GT MM YT UJ OO
11938 TQ YT RY LM None
11939 DX XX XX OT None
Get them into a single column of data
df.PATH.str.split('-', expand=True).stack()
Out[38]:
ID GROUP
11937 GT 0 MM
1 YT
2 UJ
3 OO
11938 TQ 0 YT
1 RY
2 LM
11939 DX 0 XX
1 XX
2 OT
get_dummies bring the counter as columns spread accross rows
pd.get_dummies(df.PATH.str.split('-', expand=True).stack())
Out[39]:
LM MM OO OT RY UJ XX YT
ID GROUP
11937 GT 0 0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0 1
2 0 0 0 0 0 1 0 0
3 0 0 1 0 0 0 0 0
11938 TQ 0 0 0 0 0 0 0 0 1
1 0 0 0 0 1 0 0 0
2 1 0 0 0 0 0 0 0
11939 DX 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 1 0
2 0 0 0 1 0 0 0 0
Group by the data per ID, GROUP (levels 0 and 1 in the index) to sum up the rows together and have one line per tuple. And finally reset the index to get ID and GROUP column back as regular columns.
Maybe you can try something like that.
# Test data
df = DataFrame({'GROUP': ['GT', 'TQ', 'DX'],
'ID': [11937, 11938, 11939],
'PATH': ['MM-YT-UJ-OO', 'YT-RY-LM', 'XX-XX-OT']})
# Expanding data and creating on column by token
tmp = pd.concat([df.loc[:,['GROUP', 'ID']],
df['PATH'].str.split('-', expand=True)], axis=1)
# Converting wide to long format
tmp = pd.melt(tmp, id_vars=['ID', 'GROUP'])
# Now grouping and counting
tmp.groupby(['ID', 'GROUP', 'value']).count().unstack().fillna(0)
# variable
# value LM MM OO OT RY UJ XX YT
# ID GROUP
# 11937 GT 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0
# 11938 TQ 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
# 11939 DX 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0
I have -many- csv files with the same number of columns (different number of rows) in the following pattern:
Files 1:
A1,B1,C1
A2,B2,C2
A3,B3,C3
A4,B4,C4
File 2:
*A1*,*B1*,*C1*
*A2*,*B2*,*C2*
*A3*,*B3*,*C3*
File ...
Output:
A1+*A1*+...,B1+*B1*+...,C1+*C1*+...
A2+*A2*+...,B2+*B2*+...,C2+*C2*+...
A3+*A3*+...,B3+*B3*+...,C3+*C3*+...
A4+... ,B4+... ,C4+...
For example:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
Output:
2,1,0
2,1,2
1,1,0
0,1,0
I am trying to use python.pandas and was thinking of something like this to create the reading variables:
dic={}
for i in range(14253,14352):
try:
dic['df_{0}'.format(i)]=pandas.read_csv('output_'+str(i)+'.csv')
except:
pass
and then to sum the columns:
for residue in residues:
for number in range(14254,14255):
df=dic['df_14253'][residue]
df+=dic['df_'+str(number)][residue]
residues is a list of strings which are the column names.
I have the problem that my files have different numbers of rows and are only summed up until the last row of df1. How could I add them up until the last row of the longest file - so that no data is lost? I think groupby.sum by panda could be an option but I don't understand how to use it.
To add an example - now I get this:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
File 3:
1,0,0
0,0,1
1,0,0
1,0,0
1,0,0
1,0,1
File ...:
Output:
3,1,0
2,1,3
2,1,0
1,1,0
1,0,0
1,0,1
You can use Panel in pandas , a 3Dobject, collection of dataframes :
dfs={ i : pd.DataFrame.from_csv('file'+str(i)+'.csv',sep=',',\
header=None,index_col=None) for i in range(n)} # n files.
panel=pd.Panel(dfs)
dfs_sum=panel.sum(axis=0)
dfs is a dictionnary of dataframes. Panel completes automatically lacking values with Nan and does the good sum. For example :
n [500]: panel[1]
Out[500]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
In [501]: panel[2]
Out[501]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 1 0 0
5 1 0 1
6 1 0 0
7 0 1 0
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
In [502]: panel[3]
Out[502]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 1 0 0
5 1 0 1
6 1 0 0
7 0 1 0
8 1 0 0
9 1 0 1
10 1 0 0
11 0 1 0
In [503]: panel.sum(0)
Out[503]:
0 1 2
0 3 0 0
1 3 0 3
2 3 0 0
3 0 3 0
4 2 0 0
5 2 0 2
6 2 0 0
7 0 2 0
8 1 0 0
9 1 0 1
10 1 0 0
11 0 1 0
Looking for this exact same thing, I find out that Panel is now Deprecated so I post here the news :
class pandas.Panel(data=None, items=None, major_axis=None, minor_axis=None, copy=False, dtype=None)
Deprecated since version 0.20.0: The recommended way to represent 3-D data are with a >MultiIndex on a DataFrame via the to_frame() method or with the xarray package. >Pandas provides a to_xarray() method to automate this conversion.
to_frame(filter_observations=True)
Transform wide format into long (stacked) format as DataFrame whose columns are >the Panel’s items and whose index is a MultiIndex formed of the Panel’s major and >minor
I would recommend using
pandas.DataFrame.sum
DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
Parameters:
axis : {index (0), columns (1)}
Axis for the function to be applied on.
One can use it the same way as in B.M. answer