I have a list with dataframes (each dataframe has one timeline, alsways starting with 0 and ending differently), which I would like to save as .csv:
I want to be able to read the .csv file with its original format as a list of dataframes.
Since I could not figure out how to save a list with dataframes, I concatinated the list and saved everything as one dataframe:
pd.concat(data).to_csv(csvfile)
For reading the .csv I tried this:
df = pd.read_csv(csvfile)
This will give the location of all zeros
zero_indices = list(df.loc[df['Unnamed: 0'] == 0].index)
Append the number of rows to this to get the last dataframe
zero_indices.append(len(df))
Get the ranges - tuples of consecutive entries in the above list
zero_ranges = [(zero_indices[i], zero_indices[i+1]) for i in range(len(zero_indices) - 1)]
Extract the dataframes into a list
X_test = [df.loc[x[0]:x[1] - 1] for x in zero_ranges]
The problem I have is that the index is in the final list with dataframes, but what I actually want is the column "Unnamed: 0" in the final list to be set as the index for each dataframe:
I am not entirely sure of how you wanted to approach this, but this is what I understood from your problem statement. Let me know if its what you wanted :
We have two df's :
>>> ee = {"Unnamed : 0" : [0,1,2,3,4,5,6,7,8],"price" : [43,43,14,6,4,2,6,4,2], "time" : [3,4,5,2,5,6,6,3,4], "hour" : [1,1,1,5,4,3,4,5,4]}
>>> one = pd.DataFame.from_dict(ee)
>>> dd = {"Unnamed : 0" : [0,1,2,3,4,5],"price" : [23,4,32,4,3,234], "time" : [3,2,4,3,2,4], "hour" : [3,4,3,2,4,4]}
>>> two = pd.DataFrame.from_dict(dd)
Which looks like this :
print(one)
Unnamed : 0 price time hour
0 0 23 3 3
1 1 4 2 4
2 2 32 4 3
3 3 4 3 2
4 4 3 2 4
5 5 234 4 4
print(two)
Unnamed : 0 price time hour
0 0 23 3 3
1 1 4 2 4
2 2 32 4 3
3 3 4 3 2
4 4 3 2 4
5 5 234 4 4
Now combining these two lists, by a list operator :
list_dfs = [one,two]
print(list_dfs)
[ Unnamed : 0 price time hour
0 0 43 3 1
1 1 43 4 1
2 2 14 5 1
3 3 6 2 5
4 4 4 5 4
5 5 2 6 3
6 6 6 6 4
7 7 4 3 5
8 8 2 4 4,
Unnamed : 0 price time hour
0 0 23 3 3
1 1 4 2 4
2 2 32 4 3
3 3 4 3 2
4 4 3 2 4
5 5 234 4 4]
Using the DataFrame's function
set_index()
list_dfs_index = list(map(lambda x : x.set_index("Unnamed : 0"), list_dfs))
print(list_dfs_index)
[ price time hour
Unnamed : 0
0 43 3 1
1 43 4 1
2 14 5 1
3 6 2 5
4 4 5 4
5 2 6 3
6 6 6 4
7 4 3 5
8 2 4 4,
price time hour
Unnamed : 0
0 23 3 3
1 4 2 4
2 32 4 3
3 4 3 2
4 3 2 4
5 234 4 4]
Alternatively,you can use the same set_index function to set the index as 'Unnamed : 0', before the putting the dataframes into a list.
Related
I have a dataframe that looks like
ID feature
1 2
1 3
1 4
2 3
2 2
3 5
3 8
3 4
3 2
4 4
4 6
and I want to add a new column n_ID that counts the number of times that an element occur in the column ID, so the desire output should look like
ID feature n_ID
1 2 3
1 3 3
1 4 3
2 3 2
2 2 2
3 5 4
3 8 4
3 4 4
3 2 4
4 4 2
4 6 2
I know the .value_counts() function but I don't know how to make use of this method to make the new column. Thanks in advance
Using value counts... I was thinking of this... #sophocles Thanks for transform... :)
df = pd.DataFrame({"ID":[1,1,1,2,2,3,3,3,3,4,4],
"feature":[1,2,3,4,5,6,7,8,9,10,11]})
df1 = pd.DataFrame(df["ID"].value_counts().reset_index())
df1.columns = ["ID","n_ID"]
df = df.merge(df1,how = "left",on="ID")
Just create new column and count the occurance using lambda func:
Code:
df['n_id'] = df.apply(lambda x: df['ID'].tolist().count(x.ID), axis=1)
Output:
ID feature n_id
0 1 1 3
1 1 2 3
2 1 3 3
3 2 4 2
4 2 5 2
5 3 6 4
6 3 7 4
7 3 8 4
8 3 9 4
9 4 10 2
10 4 11 2
I'm currently working on a dataframe that has processes (based on ID) that may or not reach the end of the process. The end of the process is defined as the activity which has index=6. What i need to do is to filter those processes (ID) based on the fact they are completed, which means all 6 the activities are done (so in the process we'll have activities which have index equal to 1,2,3,4,5 and 6 in this specific order).
the dataframe is structured as follows:
ID A index
1 activity1 1
1 activity2 2
1 activity3 3
1 activity4 4
1 activity5 5
1 activity6 6
2 activity7 1
2 activity8 2
2 activity9 3
3 activity10 1
3 activity11 2
3 activity12 3
3 activity13 4
3 activity14 5
3 activity15 6
And the resulting dataframe should be:
ID A index
1 activity1 1
1 activity2 2
1 activity3 3
1 activity4 4
1 activity5 5
1 activity6 6
3 activity10 1
3 activity11 2
3 activity12 3
3 activity13 4
3 activity14 5
3 activity15 6
I've tried to do so working with sum(), creating a new column 'a' and checking if the sum of every group was greater than 20 (which means taking groups in which the sum() is at least 21, which is the sum of 1,2,3,4,5,6) with the function gt().
df['a'] = df['index'].groupby(df['index']).sum()
df2 = df[df['a'].gt(20)]
Probably this isn't the best approach, so also other approaches are more than welcome.
Any idea on how to select rows based on this condition?
Another possible solution:
out = (df.groupby('ID')
.filter(lambda g: (len(g['index']) == 6) and
(g['index'].eq([*range(1,7)]).all())))
print(out)
ID A index
0 1 activity1 1
1 1 activity2 2
2 1 activity3 3
3 1 activity4 4
4 1 activity5 5
5 1 activity6 6
9 3 activity10 1
10 3 activity11 2
11 3 activity12 3
12 3 activity13 4
13 3 activity14 5
14 3 activity15 6
this may not be the fastest method, especially on a large dataframe, but it does the job
df = df.loc[df.groupby(['ID'])['index'].transform(lambda x: list(x)==list(range(1,7)))]
Or this other variation:
df = df.loc[df.groupby('ID')['index'].filter(lambda x: list(x)==list(range(1,7))).index]
Output:
ID A index
0 1 activity1 1
1 1 activity2 2
2 1 activity3 3
3 1 activity4 4
4 1 activity5 5
5 1 activity6 6
9 3 activity10 1
10 3 activity11 2
11 3 activity12 3
12 3 activity13 4
13 3 activity14 5
14 3 activity15 6
I have an excel dataset which contains 100 rows and 100 clolumns with order frequencies in locations described by x and y.(gird like structure)
I'd like to convert it to the following structure with 3 columns:
x-Coördinaten | y-Coördinaten | value
The "value" column only contains positive integers. The x and y column contain float type data (geograohical coordinates.
The order does not matter, as it can easily be sorted afterwards.
So,basicly a merge of lists could work, e.g.:
[[1,5,3,5], [4,2,5,6], [2,3,1,5]] ==> [1,5,3,5,4,2,5,6,2,3,1,5]
But then i would lose the location...which is key for my project.
What is the best way to accomplish this?
Assuming this input:
l = [[1,5,3,5],[4,2,5,6],[2,3,1,5]]
df = pd.DataFrame(l)
you can use stack:
df2 = df.rename_axis(index='x', columns='y').stack().reset_index(name='value')
output:
x y value
0 0 0 1
1 0 1 5
2 0 2 3
3 0 3 5
4 1 0 4
5 1 1 2
6 1 2 5
7 1 3 6
8 2 0 2
9 2 1 3
10 2 2 1
11 2 3 5
or melt for a different order:
df2 = df.rename_axis('x').reset_index().melt('x', var_name='y', value_name='value')
output:
x y value
0 0 0 1
1 1 0 4
2 2 0 2
3 0 1 5
4 1 1 2
5 2 1 3
6 0 2 3
7 1 2 5
8 2 2 1
9 0 3 5
10 1 3 6
11 2 3 5
You should be able to get the results with a melt operation -
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [2, 3, 4]
df.loc[:, 'x'] = [3, 4, 5]
This is what df looks like
2 3 4 x
0 0 1 2 3
1 3 4 5 4
2 6 7 8 5
The melt operation -
df.melt(id_vars='x', var_name='y')
output -
x y value
0 3 2 0
1 4 2 3
2 5 2 6
3 3 3 1
4 4 3 4
5 5 3 7
6 3 4 2
7 4 4 5
8 5 4 8
I am trying to count the consecutive elements in a data frame and store them in a new column. I don't want to count the total number of times an element appears overall in the list but how many times it appeared consecutively, i used this:
a=[1,1,3,3,3,5,6,3,3,0,0,0,2,2,2,0]
df = pd.DataFrame(list(zip(a)), columns =['Patch'])
df['count'] = df.groupby('Patch').Patch.transform('size')
print(df)
this gave me a result like this:
Patch count
0 1 2
1 1 2
2 3 5
3 3 5
4 3 5
5 5 1
6 6 1
7 3 5
8 3 5
9 0 4
10 0 4
11 0 4
12 2 3
13 2 3
14 2 3
15 0 4
however i want the result to be like this:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1
df = (
df.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
.droplevel(level=0, axis=1)
.rename(columns={"first": "Patch"})
)
print(df)
Prints:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1
I'm trying to create a historical time-series of a number of identifiers for a number of different metrics, as part of that i'm trying to create multi index dataframe and then "fill it" with the individual dataframes.
Multi Index:
ID1 ID2
ITEM1 ITEM2 ITEM1 ITEM2
index
Dataframe to insert
ITEM1 ITEM2
Date
a
b
c
looking through the official docs and this website i found the following relevant:
Add single index data frame to multi index data frame, Pandas, Python and the associated pandas official docs pages:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
https://pandas.pydata.org/pandas-docs/stable/advanced.html
i've managed with something like :
for i in df1.index:
for j in df2.columns:
df1.loc[i,(ID,j)]=df2.loc[i,j]
but it seems highly inefficient when i need to do this across circa 100 dataframes.
for some reason a simply
df1.loc[i,(ID)]=df2.loc[i] doesn't seem to work
neither does :
df1[ID1]=df1.append(df2)
which returns a Cannot set a frame with no defined index and a value that cannot be converted to a Series
my understanding from looking around is that this is because im effectively leaving half the dataframe empty ( ragged list? )
any help appreciated on how to iteratively populate my multi index DF would be greatly appreciated.
let me know if i've missed relevant information,
cheers.
Setup
df1 = pd.DataFrame(
[[1, 2, 3, 4, 5, 6] * 2] * 3,
columns=pd.MultiIndex.from_product(['ID1 ID2 ID3'.split(), range(4)])
)
df2 = df1.ID1 * 2
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 1 2 3 4 5 6
1 1 2 3 4 5 6 1 2 3 4 5 6
2 1 2 3 4 5 6 1 2 3 4 5 6
df2
0 1 2 3
0 2 4 6 8
1 2 4 6 8
2 2 4 6 8
The problem is that Pandas is trying to line up indices (or columns in this case). We can do some transpose/join trickery but I'd rather avoid that.
Option 1
Take advantage of the fact that we can assign via loc an array so long as the shape matches up. Well, we better make sure it does and that the order of columns and index are correct. I use align with the right parameter to do this. Then assign the values of the aligned df2
df1.loc[:, 'ID1'] = df2.align(df1.ID1, 'right')[0].values
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 2
Or, we can give df2 the additional level of column indexing that we need to lined it up. The use update to replace the relevant cells in place.
df1.update(pd.concat({'ID1': df2}, axis=1))
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 3
A creative way using stack and assign with unstack
df1.stack().assign(ID1=df2.stack()).unstack()
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6