Let's suppose I have data with the following structure:
(year, country, region, values)
Example:
Year, Country, Region, Values
2010 A 1 [1,2,3,...(1000 values)]
2010 A 2 [1,2,3,...(1000 values)]
...
2014 J 5 [1,2,3,...(1000 values)]
There are 5 years, 10 countries with 5 regions each and 1000 values for every combination of year, country, region.
I want to know how to decide if I should use multi-rows or multi-columns to store this kind of data. What are de main differences, if any? What are the advantages of each approach?
There are many possible ways to store this data, for example:
Multi-row (country, region), single column (year) and an array of
values
Multi-column (year, country, region) and a single value per
row
Multi-row (Country, region), multi column (Year, index of value)
Single row and have one column for year, another for country, another for region and another for the array of values.
Option 3 seems to be very bad, because there will be 5 years x 1000 columns.
Option 4 also seems to be very bad, because I would need to group by every time I need something.
You should look into "Tidy Data." The which attempts to be a standard for organizing data values within a dataset.
Principles of Tidy Data
1. Columns represent separate variables
2. Rows represent individual observations
3. Observational units form individual DataFrames.
Based on what you are saying, it seems like multi columns might be the way to go. And possibly several sets of data.
Depending what you want to do. But I would go for multi-row as I feel like pandas is built for handling columnar data. Although, long data format seems to be the preferred in general too. A quick google on 'long' and 'wide' data yields many results on wide-to-long but not other way around.
This blog post also points out some of the advantages of long over wide data format.
Related
I have a pandas data frame as below:
The demand_served column is the column I wanted to edit. For now, I just assign the same value for all the rows. My first objective here is to assign the average value to different facilities (the first column). For example, if facility A covers 10 GEOID, all these 10 rows should have 1836.988/10 = 183.6988; if facility B covers 20 GEOID, all these 20 rows share 1836.988/20 = 91.8494.
Moreover, if the same GEOID is covered by 2 facilities, for example, the above facility A and B, it should have demand_served 183.6988+91.8494 = 275.5482.
I am pretty stuck on this problem and cannot come up with any useful thoughts. Any ideas?
I have a dataset on health indicators, with columns such as 'Country', 'Year', 'GDP', and 'Life expectancy'. The data covers the years 2000-2015.
So, there is data for many health indicators for each country for each of the years from 2000-2015.
Many of the variables have missing (NaN) data for specific years/countries.
So, for instance, How would I replace NaN values with average/mean values specific to the given country/year range for all countries?
Additionally, since this is longitudinal data, it would be great to maintain the general trend over time within each country's 16 years of data. Is there a way to replace NaN data for each country, accounting for the general trend for that country/variable over time?
If you guys could explain both methods, that would be phenomenal.
link to data: https://www.kaggle.com/kumarajarshi/life-expectancy-who
Thanks,
D
screenshot of data
You probably want to look into the pd.Dataframe.interpolate() method. It has different methods for filling NaNs in a time series or filling in missing values.
Dataset
I have a movie dataset where there are over half a million rows, and this dataset looks like following (with made-up numbers)
MovieName Date Rating Revenue
A 2019-01-15 3 3.4 million
B 2019-02-03 3 1.2 million
... ... ... ...
Object
Select movies that are released "closed enough" in terms of date (for example, the release date difference of movie A and movie B is less than a month) and see when the rating is same, how the revenue could be different.
Question
I know I could write a double loop to achieve this goal. However, I am doubting this is the right/efficient way to do, because
Some posts (see comment of #cs95 to the question) suggested iterating over a dataframe is "anti-pattern" and therefore something not advisable to do.
The dataset has over half a million rows, I am not sure if writing double loop is something efficient to do.
Could someone provide pointers to the question I have? Thank you in advance.
In general, it is true that you should try avoiding loops when working with pandas. My idea is not ideal, but might point you in the right direction:
Retrieve month and year from the date column in every row to create new columns "month" and "year". You can see how to do it here
Afterwards, group your dataframe by month and year (grouped_df = df.groupby(by=["month","year])) and the resulting groups are dataframe with movies from the same month and year. Now it's up to you what further analysis you want to perform, for example mean (grouped_df = df.groupby(by=["month","year]).mean()), standard deviation or something more fancy with the apply() function.
You can also extract weeks if you want a period shorter than a month
I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)
I'm new to python and I would appreciate if you give me an answer as soon as possible.
I'm processing a file containing reviews for products that can belong to more than 1 category. What I need is to group the review ratings by the categories, and date at the same time. Since I don't know the exact number of categories, or dates in advance, I need to add rows and columns as I'm processing the reviews data (50 GB file).
I've seen how I can add columns, however my trouble is adding a row without knowing how many columns are currently in the dataframe.
Here is my code:
list1=['Movies & TV', 'Books'] #categories so far
dfMain=pandas.DataFrame(index=list1,columns=['2002-09']) #only one column at the beginnig
print(dfMain)
This is what dfMain looks like:
If I want to add a column, I simply do this:
dfMain.insert(0, date, 0) #where date is in format like '2002-09'
But if I want to add a new category(row) and fill all the dates(columns) with zeros? How do I do that? I've tried with method append, but it asks for all the columns as parameters. Method Insert doesn't seem to work either..
Here's a possible solution:
dfMain.append(pd.Series(index=dfMain.columns, name='NewRow').fillna(0))
2002-09
Movies & TV NaN
Books NaN
NewRow 0.0