how to unstack a pandas dataframe with two sets of variables

how to unstack a pandas dataframe with two sets of variables - python

I have a table that looks like this. Read from a CSV file, so no levels, no fancy indices, etc.
ID date1 amount1 date2 amount2
x 15/1/2015 100 15/1/2016 80
The actual file I have goes up to date5 and amount 5.
How can I convert it to:
ID date amount
x 15/1/2015 100
x 15/1/2016 80
If I only had one variable, I would use pandas.melt(), but with two variables I really don't know how to do it quickly.
I could do it manually exporting to a sqlite3 database in memory, and doing a union. Doing unions in pandas is more annoying because, unlike, SQL, it requires all field names to be the same, so in pandas I'd have to create a temporary dataframe and rename all the fields: a dataframe for date1 and amount1, rename the field to just date and amount, then do the same for all the other events, and only then can I do pandas.concat.
Any suggestions? Thanks!

Here is one way:
>>> pandas.concat(
... [pandas.melt(x, id_vars='ID', value_vars=x.columns[1::2].tolist(), value_name='date'),
... pandas.melt(x, value_vars=x.columns[2::2].tolist(), value_name='amount')
... ],
... axis=1
... ).drop('variable', axis=1)
ID date amount
0 x 15/1/2015 100
1 x 15/1/2016 80
The idea is to do two melts, one for each set of columns, then concat them. This assumes that the two kinds of columns are in alternating order, so that the columns[1::2] and columns[2::2] select them correctly. If not, you'd have to modify that part of it to choose the columns you want.
You can also do it with the little-known lreshape:
>>> pandas.lreshape(x, {'date': x.columns[1::2], 'amount': x.columns[2::2]})
ID date amount
0 x 15/1/2015 100
1 x 15/1/2016 80
However, lreshape is not really documented and it's not clear if it's supposed to be used.

If I assume that the columns always repeat, a simple trick provides the solution you want.
The trick lies in making a list of lists of the columns that go together, then looping over the main list appending as necessary. It does involve a call to pd.DataFrame() each time the loop runs. I am kind of pressed for time right now to find a way to avoid that. But it does work like you would expect it to, and for a small file, you should not have any problems (that is, run time).
In [1]: columns = [['date1', 'amount1'], ['date2', 'amount2'], ...]
In [2]: df_clean = pd.DataFrame(columns=['date', 'amount'])
for cols in columns:
df_clean = df_clean.append(pd.DataFrame(df.loc[:,cols].values,
columns=['date', 'amount']),
ignore_index=True)
df_clean
Out[2]: date amount
0 15/1/2015 100
1 15/1/2016 80
The neat thing about this is that it only runs over the DataFrame once, picking all the rows under the columns it is looping over. So if you have 5 column pairs, with 'n' rows under it, the loop will only run 5 times. For each run, it will append all 'n' rows below the columns to the clean DataFrame to give you a consistent result. You can then eliminate any NaN values and sort by date, or do whatever you want to do with the clean DF.
What do you think, does this beat creating an in-memory sqlite3 database?

Related

Pandas groupby user and count number of events between 2 timestamps

I have a DF1 where each row represents an "event". Each event has the columns "user", and "time":
DF1:
"user","time"
user1,2022-11-14 00:00:04
user2,2022-11-16 21:34:45
user1,2022-11-14 00:15:22
user3,2022-11-17 15:32:25
...
The "time" value is any timestamp in one week: from 2022-11-14 and 2022-11-20. There are 10k different users, and 27M events.
I have to divide the week in 8h time-slots (so 21 slots in total), and for each user, I need to look if that I can see any event of that user in each slot.
Then, I should create a DF2 (in which each row is a user) with 21 columns (one for each slot), with numbers 0 or 1: 0 if I have not seen the user in that slot, and 1 if I have seen the user in that slot.
DF2:
"user","slot1","slot2","slot3",...,"slot21"
user1,1,0,0,0,0,0,...,0
user2,0,0,1,1,1,0,...,0
user3,1,1,1,0,0,1,...,1
...
(After that, I will need to order DF2 and plot it as an spare matrix, but that is another story...)
I have managed to fill 1 single row of DF2, but it lasts 30 seconds for 1 user, in this way:
slots = pd.date_range('2022-11-14', periods=22, freq='8h')
row=[]
for i in np.arange(0,slots.value_counts().sum()-1):
if DF1[(DF1.user=="user1")&(DF1.time.between(slots[i],slots[i+1]))].shape[0] >=1:
row.append(1)
else:
row.append(0)
print(row)
So making this process for the 10k users would last almost 4 days...
Anyone have an idea on how can I achieve to create DF2 in a quicker way??
Maybe something like DF1.groupby('user').time and then what else?
I can be done in pandas or with any other way, or even different languages, if I get the spare matrix in DF2!
Any help would be much appreciated!

Use crosstab with cut for count values, if need 0,1 ouput compare for not equal 0 and cast to integers:
df = (pd.crosstab(df['user'],
pd.cut(df['time'], bins=slots, labels=False))
.ne(0)
.astype(int)
.add_prefix('slot'))

Apply function to several DataFrames by creating new DataFrame

May be Someone can just help me to find the solution:
I have 100 dataframes. Each dataframe contains time / High_Price / Low_price
I would like to create new Dataframe, which contains Gains from each DataFrame.
Example:
df1 = pd.DataFrame({"high":[5,4,5,2],
"low":[1,2,2,1]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
df100 = pd.DataFrame({"high":[7,5,6,7],
"low":[1,2,3,4]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
Functions:
def myfunc(data, amount):
data= data.loc[(data!=0).any(1)]
profit = (amount/data.iloc[0]['low']) * data.iloc[-1]['high']
return profit
Output should be:
output= pd.DataFrame({"Gain":[1,6]},
index=["df1","df100"])
How can I apply function to 100 DataFrames and get from them only Gains by creating the Dataframe, where we see the name of DataFrame and the Gain for this DataFrame?

Put your dataframes in a list and access them by integer index. Having variables named df1 to df100 is bad programming style because a) the dataframes belong together, so put them in a collection (e.g. list) and b) you cannot get "the" name of an object from its value, leading to complications such as the one you are facing now.
So let dfs be your list of 100 dataframes, starting at index 0.
Use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs], columns=['Gain'])
The index of output now corresponds to the index of dfs, starting at 0. There's no reason to rename it to 'df1' ... 'df100', you gain no information and the output becomes harder to handle.
In case of arbitrary dataframe names, use a dictionary that maps name to df. Let's call it dfs again. Then use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs.values()], columns=['Gain'], index=dfs.keys()])
I'm assuming myfunc is correct, I did not debug it.

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1

Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.

You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

How can I loop through just a certain part of a csv file?

I need to loop through certain rows in my CSV file, for example, row 231 to row 252. Then I want to add up the values that I get from calculating every row and divide them by as many rows as I looped through. How would I do that?
I'm new to pandas so I would really appreciate some help on this.
I have a CSV file from Yahoo finance looking something like this (it has many more rows):
Date,Open,High,Low,Close,Adj Close,Volume
2019-06-06,31.500000,31.990000,30.809999,31.760000,31.760000,1257700
2019-06-07,27.440001,30.000000,25.120001,29.820000,29.820000,5235700
2019-06-10,32.160000,35.099998,31.780001,32.020000,32.020000,1961500
2019-06-11,31.379999,32.820000,28.910000,29.309999,29.309999,907900
2019-06-12,29.270000,29.950001,28.900000,29.559999,29.559999,536800
I have done the basic steps of importing pandas and all that. Then I added two variables corresponding to different columns to easily reference to just that column.
import pandas as pd
df = pd.read_csv(file_name)
high = df.High
low = df.Low
Then I tried doing something like this. I tried using .loc in a variable, but that didn't seem to work. This is maybe super dumb but I'm really new to pandas.
dates = df.loc[231:252, :]
for rows in dates:
# calculations here
# for example:
print(high - low)
# I would have a more complex calculation than this but
# but for simplicity's sake let's stick with this.
The output of this would be for every row 1-252 it prints high - low, for example:
...
231 3.319997
232 3.910000
233 1.050001
234 1.850001
235 0.870001
...
But I only want this output on a certain number of rows.
Then I want to add up all of those values and divide them by the number of rows I looped. This part is simple so you don't need to include this in your answer but it's okay if you do.

Use skiprows and nrows. Keep headers as per Python Pandas read_csv skip rows but keep header by passing a range to skiprows that starts with 1.
In [9]: pd.read_csv("t.csv",skiprows=range(1,3),nrows=2)
Out[9]:
Date Open High Low Close Adj Close Volume
0 2019-06-10 32.160000 35.099998 31.780001 32.020000 32.020000 1961500
1 2019-06-11 31.379999 32.820000 28.910000 29.309999 29.309999 907900

.loc slices by label. For integer slicing use .iloc
dates = df.iloc[231:252]

Add multiple columns to multiple data frames

I have a number of number of small dataframes with a date and stock price for a given stock. Someone else showed me how to loop through them so they are contained in a list called all_dfs. So all_dfs[0] would be a dataframe with Date and IBM US equity, all_dfs[1] would be Date and MMM US Equity, etc. (example shown below). The Date column in the dataframes is always the same but the stock names are all different and the numbers associated with that stock column are always different. So when you call all_dfs[1] this is the dataframe you would see (i.e., all_dfs[1].head()):
IDX Date MMM US equity
0 1/3/2000 47.19
1 1/4/2000 45.31
2 1/5/2000 46.63
3 1/6/2000 50.38
I want to add the same additional columns to EVERY dataframe. So I was trying to loop through them and add the columns. The numbers in the stock name columns are the basis for the calculations that make the other columns.
There are more columns to add but I think they will all loop through the same way soc this is a sample of the columns I want to add:
Column 1 to add >>> df['P_CHG1D'] = df['Stock name #1'].pct_change(1) * 100
Column 2 to add >>> df['PCHG_SIG'] = P_CHG1D > 3
Column 3 to add>>> df['PCHG_SIG']= df['PCHG_SIG'].map({True:1,False:0})
This is the code that I have so far but it is returning a syntax errors for the all_dfs[i].
for i in range (len(df.columns)):
for all_dfs[i]:
df['P_CHG1D'] = df.loc[:,0].pct_change(1) * 100
So I also have 2 problems that I can not figure out
I dont know how to add columns to every dataframes in the loop. So I would have to do something like all_dfs[i].['ADD COLUMN NAME'] = df['Stock Name 1'].pct_change(1) * 100
the second part after the = which is the df['Stock Name 1'] this keeps changing (so in this example it is called MMM US Equity but the next time it would be called the column header of the second dataframe - so it could be IBM US Equity) as each dataframe has a different name so I don't know how to call that properly in the loop
I am new to python/pandas so if I am thinking about this the wrong way let me know if there is a better solution.

Consider iterating through the length of alldfs to reference each element in loop by its index. For first new column, use .ix operator to select stock column by its column position of 2 (third column):
for i in range(len(alldfs)):
dfList[i].is_copy = False # TURNS OFF SettingWithCopyWarning
dfList[i]['P_CHG1D'] = dfList[i].ix[:, 2].pct_change(1) * 100
dfList[i]['PCHG_SIG'] = dfList[i]['P_CHG1D'] > 3
dfList[i]['PCHG_SIG_VAL'] = dfList[i]['PCHG_SIG'].map({True:1,False:0})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.