I have a pandas.DataFrame with a large amount of data. In one column are randomly repeating keys. In another array I have a list of of theys keys for which I would like to slice from the DataFrame along with the data from the other columns in their row.
keys:
keys = numpy.array([1,5,7])
data:
indx a b c d
0 5 25.0 42.1 13
1 2 31.7 13.2 1
2 9 16.5 0.2 9
3 7 43.1 11.0 10
4 1 11.2 31.6 10
5 5 15.6 2.8 11
6 7 14.2 19.0 4
I would like slice all rows from the DataFrame if the value in the column a matches a value from keys.
Desired result:
indx a b c d
0 5 25.0 42.1 13
3 7 43.1 11.0 10
4 1 11.2 31.6 10
5 5 15.6 2.8 11
6 7 14.2 19.0 4
You can use isin:
>>> df[df.a.isin(keys)]
a b c d
indx
0 5 25.0 42.1 13
3 7 43.1 11.0 10
4 1 11.2 31.6 10
5 5 15.6 2.8 11
6 7 14.2 19.0 4
[5 rows x 4 columns]
or query:
>>> df.query("a in #keys")
a b c d
indx
0 5 25.0 42.1 13
3 7 43.1 11.0 10
4 1 11.2 31.6 10
5 5 15.6 2.8 11
6 7 14.2 19.0 4
[5 rows x 4 columns]
Related
Working with the following dataframe:
name abbreviation X Y Quantity Max Quantity
0 A-x A 15.6 19.4 1 2
1 A-y2 A 15.6 19.4 2 2
2 B-a B 15.0 25.0 1 2
3 B-d B 15.0 25.0 2 2
4 C-x1 C 15.0 10.0 1 3
5 C-c4 C 15.0 10.0 2 3
6 C-5 C 15.0 10.0 3 3
7 E-v E 83.4 16.5 1 4
8 E-2 E 83.4 16.5 2 4
9 E-v2 E 83.4 16.5 3 4
10 E-1 E 83.4 16.5 4 4
11 F-ab F 19.1 98.4 1 2
12 F-nb F 19.1 98.4 2 2
13 G-ku G 78.0 17.0 1 1
Depending on the x and y coordinates the quantity of same coordinate pairs is counted (5th column) and the maximum quantity of each pair is in the 6th column.
Now I want to generate a new row in front of every coordinate pair before it gets counted again, containing the abbreviation as the name, the same x and y values as the following rows and a quantity of 0.
name site_abbreviation POS_X POS_Y Quantity Max Quantity
0 A A 15.6 19.4 0 2
1 A-x A 15.6 19.4 1 2
2 A-y2 A 15.6 19.4 2 2
3 B B 15.0 25.0 0 2
4 B-a B 15.0 25.0 1 2
5 B-d B 15.0 25.0 2 2
6 C C 15.0 10.0 0 3
7 C-x1 C 15.0 10.0 1 3
8 C-c4 C 15.0 10.0 2 3
9 C-5 C 15.0 10.0 3 3
10 E E 83.4 16.5 0 4
11 E-v E 83.4 16.5 1 4
12 E-2 E 83.4 16.5 2 4
13 E-v2 E 83.4 16.5 3 4
14 E-1 E 83.4 16.5 4 4
15 F F 19.1 98.4 0 2
16 F-ab F 19.1 98.4 1 2
17 F-nb F 19.1 98.4 2 2
18 G G 78.0 17.0 0 1
19 G-ku G 78.0 17.0 1 1
This is how it should look like in the end.
Problem is to add the new row before the ones which it is conditioned on.
Let's groupby abbreviation and prepend a row before each group with Quantity column as 0 and name column as the value of abbreviation column
out = (df.groupby(['abbreviation'], as_index=False)
.apply(lambda g: pd.concat([pd.DataFrame([dict(g.iloc[0].to_dict(),
**{'Quantity': 0, 'name': g.name})]),
g]))
.reset_index(drop=True))
print(out)
name abbreviation X Y Quantity Max Quantity
0 A A 15.6 19.4 0 2
1 A-x A 15.6 19.4 1 2
2 A-y2 A 15.6 19.4 2 2
3 B B 15.0 25.0 0 2
4 B-a B 15.0 25.0 1 2
5 B-d B 15.0 25.0 2 2
6 C C 15.0 10.0 0 3
7 C-x1 C 15.0 10.0 1 3
8 C-c4 C 15.0 10.0 2 3
9 C-5 C 15.0 10.0 3 3
10 E E 83.4 16.5 0 4
11 E-v E 83.4 16.5 1 4
12 E-2 E 83.4 16.5 2 4
13 E-v2 E 83.4 16.5 3 4
14 E-1 E 83.4 16.5 4 4
15 F F 19.1 98.4 0 2
16 F-ab F 19.1 98.4 1 2
17 F-nb F 19.1 98.4 2 2
18 G G 78.0 17.0 0 1
19 G-ku G 78.0 17.0 1 1
I have a DataFrame df1 containing daily time-series of IDs and Scores in different countries C. For the countries, I have an additional DataFrame df2 which contains for each country 4 quartiles Q with quantile scores Q_Scores.
df1:
Date ID C Score
20220102 A US 12.6
20220103 A US 11.3
20220104 A US 13.2
20220105 A US 14.5
20220102 B US 9.8
20220103 B US 19.8
20220104 B US 12.3
20220105 B US 15.1
20220102 C GB 13.5
20220103 C GB 14.5
20220104 C GB 11.5
20220105 C GB 14.8
df2:
Date C Q Q_Score
20220102 US 1 10
20220103 US 2 13
20220104 US 3 16
20220105 US 4 20
20220102 GB 1 12
20220103 GB 2 13
20220104 GB 3 14
20220105 GB 4 15
I try to lookup the quartile scores Q_Score and create df3 with an additional column called Q_Scores. A specific score should lookup the next bigger quartile score from df2 for a specific country. For example:
20220104 / A / US: Score = 13.2 --> next bigger quartile score on that date in the US is 16 --> Q-Score: 16
df3:
Date ID C Score Q_Score
20220102 A US 12.6 13
20220103 A US 11.3 13
20220104 A US 13.2 16
20220105 A US 14.5 16
20220102 B US 9.8 10
20220103 B US 19.8 20
20220104 B US 12.3 13
20220105 B US 15.1 16
20220102 C GB 13.5 14
20220103 C GB 14.5 15
20220104 C GB 11.5 12
20220105 C GB 14.8 15
Because the Score and Q_Score don't match, I wasn't able to do it with a simple pd.merge().
You can use pd.merge_asof, but you need some processing:
# two data must have the same data type
df2['Q_Score'] = df2['Q_Score'].astype('float64')
# keys must be sorted
pd.merge_asof(df1.sort_values('Score'),
df2.drop(['Date','Q'], axis=1).sort_values('Q_Score'),
by=['C'],
left_on='Score',
right_on='Q_Score',
direction='forward'
).sort_values(['ID','Date'])
Output:
Date ID C Score Q_Score
4 20220102 A US 12.6 13.0
1 20220103 A US 11.3 13.0
5 20220104 A US 13.2 16.0
7 20220105 A US 14.5 16.0
0 20220102 B US 9.8 10.0
11 20220103 B US 19.8 20.0
3 20220104 B US 12.3 13.0
10 20220105 B US 15.1 16.0
6 20220102 C GB 13.5 14.0
8 20220103 C GB 14.5 15.0
2 20220104 C GB 11.5 12.0
9 20220105 C GB 14.8 15.0
I've got a dataframe that looks like:
0 1 2 3 4 5 6 7 8 9 10 11
12 13 13 13.4 13.4 12.4 12.4 16 0 0 0 0
14 12.2 12.2 13.4 13.4 12.6 12.6 19 5 5 6.7 6.7
.
.
.
Each 'layer'/row has pairs that are duplicates that I want to reduce.
The one problem is that there are repeating 0s as well so I cannot just simply remove duplicates per row or it will leave an uneven number of rows.
My desired output would be a lambda function that I could apply to all rows of this dataframe to get this:
0 1 2 3 4 5 6
12 13 13.4 12.4 16 0 0
14 12.2 13.4 12.6 19 5 6.7
.
.
.
Is there a simple function I could write to do this?
Method 1 using transpose
As mentioned by Yuca in the comments:
df = df.T.drop_duplicates().T
df.columns = range(len(df.columns))
print(df)
0 1 2 3 4 5 6
0 12.0 13.0 13.4 12.4 16.0 0.0 0.0
1 14.0 12.2 13.4 12.6 19.0 5.0 6.7
Method 2 using list comprehension with even numbers
We can make a list of even numbers and then select those columns based on their index:
idxcols = [x-1 for x in range(len(df.columns)) if x % 2]
df = df.iloc[:, idxcols]
df.columns = range(len(df.columns))
print(df)
0 1 2 3 4 5
0 12 13.0 13.4 12.4 0 0.0
1 14 12.2 13.4 12.6 5 6.7
In your case
from itertools import zip_longest
l=[sorted(set(x), key=x.index) for x in df.values.tolist()]
newdf=pd.DataFrame(l).ffill(1)
newdf
Out[177]:
0 1 2 3 4 5 6
0 12.0 13.0 13.4 12.4 16.0 0.0 0.0
1 14.0 12.2 13.4 12.6 19.0 5.0 6.7
You can use functools.reduce to sequentially concatenate columns to your output DataFrame if the next column is not equal to the last column added:
from functools import reduce
output_df = reduce(
lambda d, c: d if (d.iloc[:,-1] == df[c]).all() else pd.concat([d, df[c]], axis=1),
df.columns[1:],
df[df.columns[0]].to_frame()
)
print(output_frame)
# 0 1 3 5 7 8 10
#0 12 13.0 13.4 12.4 16 0 0.0
#1 14 12.2 13.4 12.6 19 5 6.7
This method also maintains the column names of the columns which were picked, if that's important.
Assuming this is your input df:
print(df)
# 0 1 2 3 4 5 6 7 8 9 10 11
#0 12 13.0 13.0 13.4 13.4 12.4 12.4 16 0 0 0.0 0.0
#1 14 12.2 12.2 13.4 13.4 12.6 12.6 19 5 5 6.7 6.7
I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
time Group blocks
0 1 A 4
1 2 A 7
2 3 A 12
3 4 A 17
4 5 A 21
5 6 A 26
6 7 A 33
7 8 A 39
8 9 A 48
9 10 A 59
.... .... ....
36 35 A 231
37 1 B 1
38 2 B 1.5
39 3 B 3
40 4 B 5
41 5 B 6
.... .... ....
911 35 Z 349
This is a dataframe with multiple time series-esque data, from min=1 to max=35. Each Group has a relationship in the range time=1 to time=35 .
I would like to segment this dataframe into columns Group A, Group B, Group C, etc.
How does one "unconcatenate" this dataframe?
is that what you want?
In [84]: df.pivot_table(index='time', columns='Group')
Out[84]:
blocks
Group A B
time
1 4.0 1.0
2 7.0 1.5
3 12.0 3.0
4 17.0 5.0
5 21.0 6.0
6 26.0 NaN
7 33.0 NaN
8 39.0 NaN
9 48.0 NaN
10 59.0 NaN
35 231.0 NaN
data:
In [86]: df
Out[86]:
time Group blocks
0 1 A 4.0
1 2 A 7.0
2 3 A 12.0
3 4 A 17.0
4 5 A 21.0
5 6 A 26.0
6 7 A 33.0
7 8 A 39.0
8 9 A 48.0
9 10 A 59.0
36 35 A 231.0
37 1 B 1.0
38 2 B 1.5
39 3 B 3.0
40 4 B 5.0
41 5 B 6.0
I have a pandas dataframe df and I want the final output dataframe final_df as
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 12 a 25 4
3 13 a 29 5
In [18]: final_df
Out[18]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 0.0 5.0
In [19]: dates=[10,11,12,13,14]
That is as you can see I want to fill up the missing dates and fill the corresponding values with 0 for cost column but for column prev I want to fill it with the value from previous date. As the single date may contains multiple symbol I am using the pivot_table.
If I use the ffill
In [12]: df.pivot_table(index="Date",columns="symbol").reindex(dates,method="ffill").stack().reset_index()
Out[12]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 30.0 9.0
3 11 b 33.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 29.0 5.0
This gives almost final data structure (it has 7 rows as final_df) except for cost column where it copies previous data but I want 0 there.
So I tried to fill missing values of different columns with different method, but that gives a problem, like
In [13]: df1=df.pivot_table(index="Date",columns="symbol").reindex(dates)
In [14]: df1["cost"]=df1["cost"].fillna(0)
In [15]: df1["prev"]=df1["prev"].ffill()
In [16]: df1.stack().reset_index()
Out[16]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 12 b 0.0 10.0
6 13 a 29.0 5.0
7 13 b 0.0 10.0
8 14 a 0.0 5.0
9 14 b 0.0 10.0
As you can see in output there is data with symbol "b" for date 12,13,14 but I don't want that because in initial dataframe there was no data data with symbol "b" for date 12,13 and I want to keep it that way and also there must not be one in new date 14 as it follows 13.
So how can I solve this problem and get the final_df output?
EDIT
Here is another example to check the program.
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 14 a 29 5
In [18]: dates=range(10,17)
In [19]: final_df
Out[19]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 11 a 0 9
3 11 b 0 10
4 12 a 0 9
5 12 b 0 10
6 13 a 0 9
7 13 b 0 10
8 14 a 29 5
9 15 a 0 5
10 16 a 0 5
Solution
I have found this way to the problem. Here I using a trick that keeps track of the missing places in in the initial pivot_table and removes finally.
In [44]: df1=df.pivot_table(index="Date",columns='symbol',fill_value="missing").reindex(dates)
In [45]: df1["cost"]= df1["cost"].fillna(0)
In [46]: df1["prev"]=df1["prev"].ffill()
In [47]: df1.stack().replace(to_replace="missing",value=np.nan).dropna().reset_index()
Out[47]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 0.0 9.0
5 12 b 0.0 10.0
6 13 a 0.0 9.0
7 13 b 0.0 10.0
8 14 a 29.0 5.0
9 15 a 0.0 5.0
10 16 a 0.0 5.0