I have a dataframe created by the following code:
dfHubR2I=dfHubPV2.loc[dfHubPV2['Ind'].dt.year == year, :].groupby(['SHOP_CODE', dfHubPV2['Ind'].dt.month])['R2I'].agg(['median']).fillna('-')
dfHubR2I=dfHubR2I['median'].unstack('SHOP_CODE')
dfHubR2I=dfHubR2I.iloc[:date.month-1]
dfHubR2I
It looks like this:
shop code A B C D All Shops
ind
1 23 34 23 56 34
2 13 23 45 47 34
3 56 67 42 85 57
4 3 3 2 6 46
where ind is months and the letters are different shops
I then got the median across all the shops for each month from this code:
dfHubR2Imonthallshops=dfHubPV2.loc[dfHubPV2['Ind'].dt.year == year, :].groupby([dfHubPV2['Ind'].dt.month])['R2I'].agg(['median']).fillna('-')
dfHubR2Imonthallshops=dfHubR2Imonthallshops.rename(columns={'median':'All Shops'})
dfHubR2Imonthallshops=dfHubR2Imonthallshops.iloc[:date.month-1]
dfHubR2Imonthallshops
which looks like this:
A B C D All shops
median 2 3 4 5 2
And I need to append it onto the bigger dataframe as a row but when I try to use pd.concat I get the error InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I'm assuming it's because the larger dadtaframe has 2 levels but I'm not sure how to go about getting my final desired result:
shop code A B C D All shops
ind
1 23 34 23 56 34
2 13 23 45 47 34
3 56 67 42 85 57
4 3 3 2 6 46
YTD 2 3 4 5 2
Have you tried to do it with an assignment?
dfHubR2I.loc['YTD', :] = dfHubR2Imonthallshops.loc['median', :]
Eleonora
I asked something similar yesterday but I had to rephrase the question and change the dataframes that I'm using. So here is my question again:
I have a dataframe called df_location. In this dataframe I have duplicated ids because each id has a timestamp.
location = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value':[20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]}
df_location = pd.DataFrame(location)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
What I am trying to achieve is to map the values of list_of_locations to the location_id. If the values are the same , then the island_id for this location should be appended to a new column in df_location.
(Note that: I don't want to remove any duplicated Id, I need to keep them as they are)
Resulting dataframe:
final_dataframe = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76],
'island_id':[10,10,10,10,20,20,20,20,20,20,30,30,40,40,40,50,60]}
df_final_dataframe = pd.DataFrame(final_dataframe)
This is just a sample from the dataframe that I have. What I have is dataframe of 13,000,0000 rows and 4 columns. How can this be achieved in an efficient way ? Is there a pythonic way to do it ?I tried using for loops but it takes too long and still it didn't work. I would really appreciate it if someone can give me a solution to this problem.
Here's a solution:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup, left_on="location_id", right_on="location").drop("location", axis=1)
The output is:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 1 21 61 10
2 1 22 62 10
3 1 23 63 10
4 2 24 64 20
5 2 25 65 20
6 2 27 66 20
7 3 28 67 20
8 3 29 68 20
9 3 30 69 20
10 4 31 63 30
11 5 32 64 30
12 6 33 65 40
13 7 34 66 40
14 8 35 67 40
15 9 36 68 50
16 10 37 69 60
If some of the locations don't have a matching island_id, but you'd still like to see them in the results (with island_id NaN), use how="left" in the merge statement, as in:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup,
left_on="location_id",
right_on="location",
how = "left").drop("location", axis=1)
The result would be (note location-id 12 on row 3):
location_id temperature_value humidity_value island_id
0 1 20 60 10.0
1 1 21 61 10.0
2 1 22 62 10.0
3 12 23 63 NaN
4 2 24 64 20.0
5 2 25 65 20.0
6 2 27 66 20.0
...
I have imported this file as a Data Frame in Pandas. The left-most column is time (7 am to 9:15 am. Rows show traffic volume at intersection in 15 minute increments. How do I find the peak hour? or the hour with most volume? To get the hourly volumes, I have to add 4 rows.
I am a newbie with Python and any help is appreciated.
import pandas as pd
f_path ="C:/Users/reggi/Dropbox/1. 2020/6. Programming Python/Text Files/TMC118txt.txt"
df = pd.read_csv(f_path, index_col=0, sep='\s+')
Sample of the data file below:: First column is time in 15 minute increments, first row is traffic count by movement.
NBL NBT NBR SBL SBT SBR EBL EBT EBR WBL WBT WBR
715 8 3 12 1 1 0 4 93 18 36 68 4
730 16 5 20 5 2 1 0 135 12 39 128 3
745 9 1 29 6 2 3 4 169 21 28 163 6
800 10 2 33 4 0 4 4 147 8 34 174 6
815 11 1 30 1 4 3 4 93 10 28 140 8
My approach would be to move the time to a column:
df.reset_index(inplace=True)
Then I would create a new column for hour and one for minutes:
df['hour'] = df['index'].apply(lambda x: x[:-2])
df['minute'] = df['index'].apply(lambda x: x{-2:]
Then you could do a groupby on hour and sum the traffic movements, sort, etc.
hourly = df.groupby(by='hour').sum()
I would like to sum the frequencies over multiple columns with pandas. The amount of columns can vary between 2-15 columns. Here is an example of just 3 columns:
code1 code2 code3
27 5 56
534 27 78
27 312 55
89 312 27
And I would like to have the following result:
code frequency
5 1
27 4
55 1
56 2
78 1
312 2
534 1
To count values inside one column is not the problem, just need a sum of all frequencies in a dataframe a value can appear, no matter the amount of columns.
You could stack and take the value_counts on the resulting series:
df.stack().value_counts().sort_index()
5 1
27 4
55 1
56 1
78 1
89 1
312 2
534 1
dtype: int64
I'm working on a ML project for a class. I'm currently cleaning the data and I encountered a problem. I basically have a column (which is identified as dtype object) that has ratings about a certain aspect in a hotel. When i checked what the values of this column were and in what frequency they appeared, I noticed that there are some wrong values in it (as you can see below, instead of ratings, some rows have a date as a value)
rating value_counts()
100 527
98 229
97 172
99 163
96 150
95 127
93 100
90 94
94 93
80 65
92 55
91 39
88 35
89 32
87 31
85 25
86 17
84 12
60 12
83 8
70 5
73 5
82 4
78 3
67 3
2018-11-11 3
20 2
81 2
2018-11-03 2
40 2
79 2
75 2
2018-10-26 2
2 1
2018-08-30 1
2018-09-03 1
2015-09-05 1
55 1
2018-10-12 1
2018-05-11 1
2018-11-14 1
2018-09-15 1
2018-04-07 1
2018-08-16 1
71 1
2018-09-18 1
2018-11-05 1
2018-02-04 1
NaN 1
What I wanted to do was to replace all the values that look like dates with NaN so I can later fill them with appropriate values. Is there a good way to do this other than selecting each different date one by one and replacing it with a NaN? Is there a way to select similar values (in this case all the dates that start in the same way, 2018) and replace them all?
Thank you for taking the time to read this!!
There are multiple options to clean this data.
Option 1: Rating column is ofobject type, search the strings by presence of '-' and replace with np.nan
df.loc[df['rating'].str.contains('-', na = False), 'rating'] = np.nan
Option 2: Convert the column to numeric which will coerce the dates to nan.
df['rating'] = pd.to_numeric(df['rating'], errors = 'coerce')