I am working with an Excel sheet with pandas where I am analysing some data from it.
Inside the excel sheet I have 8 Columns one is Timestamp and another one is City column and so on like Domain, State etc.
I want to analyse city and timestamp columns only.
I have selected the city and Timestamp column from the excel sheet inside a Data Frame. I have find out the city count, means how many rows contain the same city using cities_df['Count'] = df['City_Town_Village '].value_counts()
After finding the city count I have find out the percentages of all the cities using cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Now my question here is, As I am finding the city_count The rows inside my dataframe is decreasing means my df has 238 rows but after the count they are decreasing to 128 no issue at all till now. They are decreasing just because of the count.
I also have the timestamp column inside my df, lets say for city Delhi some people registered in 28-May-2021 and some people registered in 29-May-2021 just like that. But after finding out the city_count my df is just showing me timestamp for the beginning date only i.e 28-may..
I don't know why this is happening, actually I want to segregate the data into two week and want to plot the graph on week wise. and also for city percentage.
Here is my Excel file
This is the code I'm using:
import pandas as pd
df = pd.read_excel('PCS_NWR_Sheet.xlsx')
df.head()
pd.set_option('display.max_rows', 300)
cities_df = pd.DataFrame()
cities_df['Count'] = df['City_Town_Village '].value_counts()
cities_df.index.names=['City']
cities_df.reset_index(inplace = True)
cities_df['Timestamp'] = df['Timestamp']
column_names = ['Timestamp', 'City', 'Count']
cities_df = cities_df.reindex(columns=column_names)
cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Metro_list = ['Hyderabad', 'Kolkata', 'Delhi', 'Pune', 'Bengaluru', 'Noida', 'Kanpur', 'Gurgaon']
top_metro=cities_df[cities_df['City'].isin(Metro_list)]
top_metro
.value_counts() will return a series where the size is equal to the number of unique elements in what you are counting. So you are getting less rows because it is grouping those things.
I can think of two ways to solve the question (if I understand it right).
Do .value_counts() on both the date column and the city column.
df[['City_Town_Village','Date']].value_counts()
If you don't currently have your timestamps as dates, you'll need to make a date column that does that (you probably won't be able to group on datetimes since the times will vary. This will give a series where the row count is equal to the size of every existing combination of the two columns.
Make a separate dataframe with the value_count of the town column, and merge them. That is, if you want a column in your main dataframe that has the number of times that the town comes up ever in the data, that column is a different size (as we said), so you'll store it somewhere else but can bring it back in as needed.
df2 = pd.DataFrame(df['City_Town_Village'].value_counts())
df2.reset_index(inplace=True) # by fault, making the df from value_counts() will make your city/town the index, this makes a normal index)
df2.columns = ['City_Town_Village','Count'] #rename the columns
df = df.merge(df2,how='left',on='City_Town_Village')
This will make df to have the Counts column added, where it will be the count of the City/Town in the original dataset.
Related
I'm new to the world of python so I apologize in advance if this question seems pretty rudimentary. I'm trying to pull columns of one dataframe into a separate dataframe. I want to replace the duplicate columns from the first dataframe with one column that contains the mean values into the second dataframe. I hope this makes sense!
To provide some background, I am tracking gene expression over certain time points. I have a dataframe that is 17 rows x 33 columns. Every row in this data frame corresponds to a particular exon. Every column on this data frame corresponds to a time-point (AGE).
The dataframe looks like this:
Some of these columns contain the same name (age) and I'd like to calculate the mean of ONLY the columns with the same name, so that, for example, I get one column for "12 pcw" rather than three separate columns for "12 pcw." After which I hope to pull these values from the first dataframe into a second dataframe for averaged values.
I'm hoping to use a for loop to loop through each age (column) to get the average expression across the subjects.
I will explain my process so far below:
#1) Get list of UNIQUE string names from age list
unique_ages = set(column_names)
#2) Create an empty dataframe that gives an outline of what I want my averaged data to fit/be put in
mean_df = pd.DataFrame(index=exons, columns=unique_ages)
#3) Now I want to loop through each age to get the average expression across the donors present. This is where I'm trying to utilize a for loop to create a pipeline to process other data frames that I will be working with in the future.
for age in unique_ages:
print(age)
age_df = pd.DataFrame() ##pull columns of df as separate df that have this string
if len(age_df.columns) > 1: ##check if df has >1 SAME column, if so, take avg across SAME columns
mean = df.mean(axis=1)
mean_df[age] = mean
else:
## just pull out the values and put them into your temp_df
#4) Now, with my new averaged array (or same array if multiple ages NOT present), I want to place this array into my 'temp_df' under the appropriate columns. I understand that I should use the 'age' variable provided by the for loop to get the proper locationname of the column in my temp df. However I'm not sure how to do this. This has all been quite a steep learning curve and I feel like it's a simple solution but I can't seem to wrap my head around it. Any help would be greatly appreciated.
There is no need for a for loop (there often isn't with Pandas :)). You can simply use df.groupby(lambda x:x, axis=1).mean(). An example:
data = [[1,2,3],[4,5,6]]
cols = ['col1', 'col2', 'col2']
df = pd.DataFrame(data=data, columns=cols)
# col1 col2 col2
# 0 1 2 3
# 1 4 5 6
df = df.groupby(lambda x:x, axis=1).mean()
# col1 col2
# 0 1.0 2.5
# 1 4.0 5.5
The groupby function takes another function (the lambda) which basically means that it will insert each column name, and that it will return the group that column belongs to. In our case, we just want the column name itself to be the group. So, on the third column named col2, it will say 'this column belongs to group named col2' which already exists (because the second column was passed earlier). You then provide the aggregation you want, in this case the mean().
I want group the data from the table according to 'name' column values and keep the information where the 'name' column values have occurred more than one time in the table. The following code works fine for the given small data table.
import pandas as pd
data={'Name':['Danny','Damny','Monny','Quony','Dimny','Danny'],
'Email':['danny#gmail.com','danny#gmail.com','monny#gmail.com','quony#gmail.com','danny#gmail.com','danny#gmail.com'],
'IBAN':['NLAMRO123456789','NLINGB126656723','BGFFEO128856754','NLAMRO123896763','DUDMRO567456722','NLRABO123456712']} #data with three columns
df=pd.DataFrame(data) #creation of dataframe
df['No Dutch Bank']=None #creation of extra column for analysis
df.loc[df['IBAN'].str.find('NL') == -1, 'No Dutch Bank']='ja'# to find rows, which contain non dutch bank numbers.
df_filt=df[['Name',"Email", "IBAN"]]#filtering columns needed in the final results
df_gb = df_filt[df_filt.duplicated(subset=['Name'], keep=False)].sort_values(by='Name', ascending=False).reset_index(drop=True)#filtering on a required column
piv_tab = pd.pivot_table(df_gb, index=['Name',"Email", "IBAN"])#applying pivot table
piv_tab
This works fine for original data of three columns and six rows. In practice I have data of thirty columns and thirty thousand (30000) rows. When I select three columns (Name, Email and IBAN) and run the same code, the code does not filter out occurrence of rows which appeared in the table only once.
Why?
Everything except for the pivot logic is correct and should work for larger dataset as well.
Another approach would be calculate the count of names, filter by count > 1, and merge the names back to this table via left join.
This way is more explicit in filter by count > 1, and is more flexible to support filter by count > 2 etc.
import pandas as pd
# count names
df_name = df_filt.groupby('Name')['Email'].count()
# get names with count > 1
df_name = df_name[df_name > 1].reset_index()
# merge filtered names back to original df to get filtered df
df_gb = pd.merge(df_name['Name'], df_filt, on=['Name'], how='left')
# sorting etc.
df_gb = df_gb.sort_values(by='Name', ascending=False).reset_index(drop=True)
# and some pivot stuff
...
I have a DataFrame that is the result of a large SQL query. I am trying to sort the DataFrame into 2 separate DataFrames. NVI and Main. They are both a list of repairs to trucks. I need to sort it based on if there is a specific profile id which is 7055. Which will go into the NVI DataFrame
If that job is encountered I need to grab the values from the "RO" "Unit Number" and Repair Date column. I then need to take those values and search the DataFrame again and grab any rows that have a matching RO and Unit number or a matching Unit number and a Repair date that is equal to or earlier than the date value in the the row that the 7055 was found. Those rows then need to go into the NVI df. Any remaining rows that do not match will go into the Main df.
The only static value is the profile id of 7055. The RO Unit Number and Repair date will all be different.
class nvi_dict(dict):
def __setitem__(self, key, value):
key = key.profile()
super().__setitem__(key, value)
nvisort = pd.DataFrame()
def sort_nvi_dict(row, component):
if row ['PROFILE_ID'] in cfg[component]['nvi']:
nvi_ro = nvi_dict()
nvi_ro ['RO'] = row ['RO']
nvi_ro ['UnitNum'] = row ['VFUNIT']
nvi_ro ['date']= row['REPAIR_DATE']
nvisort = nvidf.apply(lambda x: sort_nvi_dict(x, 'nvi_ro'), axis=1, result_type='expand')
I thought about trying to use a class to create a temp dict object to store the values from RO, UnitNum and Date. Which I can then call on to iterate over the df again looking for matching values.
I am using a .yml file to store dictionaries. That I am using to further sort each of the NVI and Main df's after they have been sorted out. Because they will then need to each be sorted by truck manufacturer
I think this might work, unable to test without the test data though...
df1 = nvisort[nvisort['profile_id'] = 7055]
df2 = pd.merge(nvisort,df1[['RO','Unit Number']],on=['RO','Unit number'],how='right')
df3 = pd.merge(nvisort,df1[['Unit Number','Repair Date']],on='Unit Number'],how='right')
df3 = df3[df3['Repair Date_x'] <= df3['Repair Date_y']]
df3 = df3.drop(columns='Repair Date_y']
df3 = df3.rename(columns={'Repair Date_x':'Repair Date'})
NVI = pd.concat([df1,df2,df3])
Main = pd.concat([NVI,nvisort]).drop_duplicates(keep=False)
I'm assuming that your original/starting dataframe here is the nvisort, and then we filter that just to get profile_id of 7055 and call that df1
Then we are going to get your two different pieces of criteria into df2 and df3.
df2 is just a filter on the original dataframe where RO and Unit Number match, so we can use pd.merge() to effectively get that filter.
df3 is a more complicated filter since it is the less than or equal, not the equal. So first we do the merge to filter on matching unit numbers, but we also bring over the Repair Date from both tables into df3, and these get appended _x and _y on the column names. So then we filter where the date on the _x is less than on _y and then clean it up.
Last, you get Main by finding everything from the original nvisort that is not in NVI. Since NVI is a subset of nvisort, you can just concat them and drop all duplicates, leaving only data that exists in one of the dataframes.
From what i understand of your question, you want to divide a dataframe into 2 based on certain conditions?
df1 = df[<condition>]
condition can be - df[profile id] == 7055 and Allunits.contains(df[unit])
I have a dataframe of Covid-19 deaths by country. Countries are identified in the Country column. Sub-national classification is based on the Province column.
I want to generate a dataframe which sums all columns based on the value in the Country column (except the first 2, which are geographical data). In short, for each date, I want to compress the observations for all provinces of a country such that I get a single number for each country.
Right now, I am able to do that for a single date:
import pandas as pd
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-
19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
raw = pd.read_csv(url)
del raw['Lat']
del raw['Long']
raw.rename({'Country/Region': 'Country', 'Province/State': 'Province'}, axis=1, inplace=True)
raw2 = raw.groupby('Country')['6/29/20'].sum()
How can I achieve this for all dates?
You can use iloc:
raw2 = raw.iloc[:,4:].groupby(raw.Country).sum()
I have two data frames loaded into Pandas. Each data frame holds property information indexed by a 'pin' unique to a particular parcel of land.
The first data frame (df1) represents historic sales data. Because properties can be sold multiple times, index values (the 'pin') repeat (i.e. for each time a property was sold there will be a row with the parcel's 'pin' as the index number. If the property is sold 1 time in the data set, the index/'pin' is unique. If it was sold 5 times, the index/'pin' will occur 5 times in the data set).
The second data frame (df2) is a property record. Again they are indexed by the unique parcel pin, but because this data frame is a record of each property, the value_counts() for each index value is 1 (i.e. index values do not repeat).
I would like to add data to df1 or create a new data frame which keeps all data from df1 intact, but adds values from df2 based upon matching index values.
For Example: df1 has columns ['SALE_YEAR', 'SALE_VALUE'] - where there can be multiple rows with the same index value. df2 has columns ['Address', 'SQFT'], where the index values are all unique within the data frame. I want to add 'Address' & 'SQFT' data points to df1 by matching the index values.
Merge() & Concat() do not seem to work. I believe this is because the syntax is having a hard time processing/ matching df2 values to multiple df1 rows.
Visual Example:
Thank you for the help.
Try this:
import pandas as pd
merged_df = pd.merge(left=df1, right=df2, on='PIN', how='left')
If that still isn't working, maybe the PIN columns datatypes do not match.
df1['PIN'] = df1['PIN'].astype(int)
df2['PIN'] = df2['PIN'].astype(int)
merged_df = pd.merge(left=df1, right=df2, on='PIN', how='left')