I want to compute the similarity of column Aname from dataframe Apple to column Bname from dataframe Banana, and create a new column in dataframe A that shows the similarity, my code is as follows:
Bname = []
similarity = []
for i in Apple.Aname:
ratio = process.extract( i, Banana.Bname, limit=1)
Bname.append(ratio[0][0])
similarity.append(ratio[0][1])
Apple['Bname'] = pd.Series(Bname)
Apple['similarity'] = pd.Series(similarity)
However, there are over 400,000 rows in dataframe Apple and over 700,000 rows in dataframe Banana, my code runs for hours and I still haven't get the result. I wonder how to do it more efficiently? Or at least how could I know the progress of my code? Thanks a lot for your great help in advance!
Related
My personnel side project right now is to analyze GDP growth rates per capita. More specifically, I want to find the average growth rate for each decade since 1960, and then analyze it.
I pulled data from the World Bank API("wbgapi")as a DataFrame:
import pandas as pd
import wbgapi as wb
gdp=wb.data.DataFrame('NY.GDP.PCAP.KD.ZG')
gdp.head()
Output:
gdp
I then used nested for loops to calculate the mean for every decade and added it to a new dataframe.
row, col = gdp.shape
meandata = pd.DataFrame(columns = ['Country', 'Decade', 'MeanGDP', 'Region'])
for r in range (0, row, 1):
countrydata = gdp.iloc[r]
for c in range (0, col-9, 10):
decade = 1960+c
tenyeargdp = countrydata.array[c:c+10].mean()
meandata = meandata.append({'Country': gdp.iloc[r].name, 'Decade': decade, 'MeanGDP': tenyeargdp}, ignore_index=True)
meandata.head(10)
The code works and generates the following output: meandata
However, I have a few questions about this step:
Is there a more efficient way to do access the subseries of dataframes? I read that "for loops" should never be used for dataframes and that one should vectorize operations on dataframes?
Is the complexity O(n^2) since there are 2 for loops?
The second step is to group the individual countries by region, for future analysis. To do so I rely on the World Bank API which has its own Region, which each has a list of member economies/countries.
I iterated through the regions and the member list of each region. If a Country is part of the Region list I added that region series.
Since an economy/country can be part of multiple regions(ie the 'USA' can be part of NA and HIC(high-income country)), I concatenated the region to the previously added regions.
for rg in wb.region.list():
for co in wb.region.members(rg['code']):
str1 ='-'+meandata.loc[meandata['Country']==co, ['Region']].astype(str)
meandata.loc[meandata['Country']==co, ['Region']] = rg['code']+ str1
The code works mostly, however, sometimes it gives the error message that 'meandata' is not defined. I use Jupyter-Lab.
Additionally, Is there a simpler/more efficient way of doing the second step?
Thanks for reading and helping. Also, this is my first python/pandas coding experience, and as such general feedback is appreciated.
Consider to use groupby:
The aggregation will be based on columns you insert inside a List of columns in groupby functions.
In sample below I get the mean for 'County' and 'Region'.
metadata = metadata.groupby(['County','Region']).agg('MeanGDP':'mean').reset_index()
I first found a cycle in a manufacturing process. I collected the 2 largest pressure values from the given cycles and printed them to a new sheet. I now need to capture the corresponding time to where the largest values land. This portion of my code looks like this:
df2 = df.groupby('group')['Pressure'].nlargest(2).rename_axis (index=['group','row_index'])
df2 = df.groupby('group')['Date/Time']
A sample snippet of the data I am trying to extract can be seen here:
Any help on this would be appreciated!
You can sort the data frame and take the last 2 rows per group. Typing this in the blind as you did not provide sample data:
df2 = (
df.sort_values(['group', 'Pressure'])
.groupby('group', sort=False)
.tail(2)
)
I have a CSV file, which has several columns and several rows. Please, see the picture above. In the picture is shown just the two first baskets, but in the original CSV -file I have hundreds of them.
[1]: https://i.stack.imgur.com/R2ZTo.png
I would like to calculate average for every Fruit in every Basket using Python. Here is my code but it doesn't seem to work as it should be. Better ideas? I have tried to fix this also importing and using numpy but I didn't succeed with it.
I would appreciate any help or suggestions! I'm totally new in this.
import csv
from operator import itemgetter
fileLineList = []
averageFruitsDict = {} # Creating an empty dictionary here.
with open('Fruits.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for column in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageFruitsDict[row[0]] = [highest, lowest, round(average)]
averageFruitsList = []
for key, value in averageFruitsDict.items():
averageFruitsList.append([key, value[2]])
print('\nFruits in Baskets\n')
print(averageFruitsList)
--- So I'm know trying with this code:
import pandas as pd
fruits = pd.read_csv('fruits.csv', sep=';')
print(list(fruits.columns))
fruits['Unnamed: 0'].fillna(method='ffill', inplace = True)
fruits.groupby('Unnamed: 0').mean()
fruits.groupby('Bananas').mean()
fruits.groupby('Apples').mean()
fruits.groupby('Oranges').mean()
fruits.to_csv('results.csv', index=False)
It creates a new CSV file for me and it looks correct, I don't get any errors but I can't make it calculate the mean of every fruit for every basket. Thankful of all help!
So using the image you posted and replicating/creating an identical test csv called fruit - I was able to create this quick solution using pandas.
import pandas as pd
fruit = pd.read_csv('fruit.csv')
With the unnamed column containing the basket numbers with NaNs in between - we fill with the preceding value. By doing so we are then able to group by the basket number (by using the 'Unnamed: 0' column and apply the mean to all other columns)
fruit['Unnamed: 0'].fillna(method='ffill', inplace = True)
fruit.groupby('Unnamed: 0').mean()
This gets you your desired output of a fruit average for each basket (please note I made up values for basket 3)
I was struggling with how to word the question, so I will provide an example of what I am trying to do below. I have a dataframe that looks like this:
ID CODE COST
0 60086 V2401 105.38
1 60142 V2500 221.58
2 60086 V2500 105.38
3 60134 V2750 35
4 60134 V2020 0
I am trying to create a dataframe that has the ID as rows, the CODE as columns, and the COST as values since the cost for the same code is different per ID. How can I do this in?
This seems like a classic "long to wide" problem, and there are several ways to do it. You can try pivot, for example:
df.pivot_table(index='ID', columns='CODE', values='COST')
(assuming that the dataframe is df.)
I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))