I have a problem where I got a csv data like this:
AgeGroup Where do you hear our company from? How long have you using our platform?
18-24 Word of mouth; Supermarket Product 0-1 years
36-50 Social Media; Word of mouth 1-2 years
18-24 Advertisement +4 years
and I tried to make the file into this format through either jupyter notebook or from excel csv:
AgeGroup Where do you hear our company from?
18-24 Word of mouth 0-1 years
18-24 Supermarket Product 0-1 years
36-50 Social Media 1-2 years
36-50 Word of mouth 1-2 years
18-24 Advertisement +4 years
Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook:
data = pd.read_csv('Untitled form.csv')
Can anyone tell me how should I do it?
I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column
Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252'
So it would become like this:
import pandas as pd
data_split = pd.read_csv('Untitled form split.csv',
skipinitialspace=True,
usecols=range(1,7),
encoding='cp1252')
However if there's a more efficient way, please let me know. Thanks
I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it.
import pandas as pd
data = pd.read_fwf('Untitled form.csv')
cols = data.columns
data_long = pd.DataFrame(columns=data.columns)
for idx, row in data.iterrows():
hear_from = row['Where do you hear our company from?'].split(';')
hear_from_fmt = list(map(lambda x: x.strip(), hear_from))
n_items = len(hear_from_fmt)
d = {
cols[0] : [row[0]]*n_items,
cols[1] : hear_from_fmt,
cols[2] : [row[2]]*n_items,
}
data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True)
Let's brake it down.
This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help.
Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together.
Now there might be a more efficient solution, but this should work.
I have this code.
cheese_sums = []
for year in milk_products.groupby(milk_products['Date']):
total = milk_products[milk_products['Date'] == year]['Cheddar Cheese Production (Thousand Tonnes)'].sum()
cheese_sums.append(total)
print(cheese_sums)
I am trying to sum all the Cheddar Cheese Production, which are stored as floats in the milk_products data frame. The Date column is a datetime object that holds only the year, but has 12 values representing each month. As it's written now, I can only print a list of six 0.0's.
I got it. It should be:
cheese_sums = []
for year in milk_products['Date']:
total = milk_products[milk_products['Date'] == year]['Cheddar Cheese Production (Thousand Tonnes)'].sum()
if total not in cheese_sums:
cheese_sums.append(total)
print(cheese_sums)
You seem to think too complicated.
Try groupby(...).sum()
df = milk_products.groupby('Date').sum()
I have just started out with Pandas and I am trying to do a multilevel sorting of data by columns. I have four columns in my data: STNAME, CTYNAME, CENSUS2010POP, SUMLEV. I want to set the index of my data by columns: STNAME, CTYNAME and then sort the data by CENSUS2010POP. After I set the index the appears like in pic 1 (before sorting by CENSUS2010POP) and when I sort and the data appears like pic 2 (After sorting). You can see Indices are messy and no longer sorted serially.
I have read out a few posts including this one (Sorting a multi-index while respecting its index structure) which dates back to five years ago and does not work while I write them. I am yet to learn the group by function.
Could you please tell me a way I can achieve this?
ps: I come from a accounting/finance background and very new to coding. I have just completed two Python course including PY4E.com
used this below code to set the index
census_dfq6 = census_dfq6.set_index(['STNAME','CTYNAME'])
and, used the below code to sort the data:
census_dfq6 = census_dfq6.sort_values (by = ['CENSUS2010POP'], ascending = [False] )
sample data I am working, I would love to share the csv file but I don't see a way to share this.
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50
Required End Result:
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I have a small excel file that contains prices for our online store & I am trying to automate this process, however, I don't fully trust the stuff to properly qualify the data, so I wanted to use Pandas to quickly check over certain fields, I have managed to achieve everything I need so far, however, I am only a beginner and I cannot think of the proper way for the next part.
So basically I need to qualify 2 columns on the same row, we have one column MARGIN, if this column is >60, then I need to check that the MARKDOWN column on the same row is populated == YES.
So my question is, how can I code it to basically say-
Below is an example of the way I have been doing my other checks, I realise it is quite beginner-ish, but I am only a beginner.
sku2 = df['SKU_2']
comp_at = df['COMPARE AT PRICE']
sales_price = df['SALES PRICE']
dni_act = df['DO NOT IMPORT - action']
dni_fur = df['DO NOT IMPORT - further details']
promo = df['PROMO']
replacement = df['REPLACEMENT']
go_live_date = df['go live date']
markdown = df['markdown']
# sales price not blank check
for item in sales_price:
if pd.isna(item):
with open('document.csv', 'a', newline="") as fd:
writer = csv.writer(fd)
writer.writerow(['It seems there is a blank sales price in here', str(file_name)])
fd.close
break
Example:
df = pd.DataFrame([
['a',1,2],
['b',3,4],
['a',5,6]],
columns=['f1','f2','f3'])
# | represents or
print(df[(df['f1'] == 'a') & (df['f2'] > 1)])
Output:
f1 f2 f3
2 a 5 6
I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.
In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.