I have a code of writing peoples names, ages and scores for a quiz that I made. I simplified the code to write the names and ages together and not separately but I cant write the score with the names as they are in separate parts of the code. The CSV file looks like this
name, age, score
Alfie, 15, 20
Michael, 16, 19
Alfie, 15, #After I simplified
Dylan, 16,
As you can see i don't know how to write a value in the 3rd column. Does anyone know how to write a value into the next available cell in a CSV file in the column 2. I'm new to programming so any help would be greatly appreciated.
Michael
This is your data:
df = pd.DataFrame({'name':['Alfie','Michael','Alfie','Dylan'], 'age':[15,16,15,16], 'score':[20,19,None,None]})
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 nan
3 Dylan 16 nan
if you need read csv to pandas then use:
import pandas as pd
df = pd.read_csv('Your_file_name.csv')
I suggest two ways to solve your problem:
df.fillna(0, inplace=True) fill all (this example fill 0).
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 0.0
3 Dylan 16 0.0
df.loc[2,'score'] = 22 fill specific cells
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 22.0
3 Dylan 16 nan
If, after that you need write your fixed data to csv, the use:
df.to_csv('New_name.csv', sep=',', header=0)
Related
Update: so I'm messing around with the two files, I copied the data from DOB column from the 2nd file to the 1st file to make the files look visually identical. However I notice some really interesting behaviour when using Ctrl+F in Microsoft Excel. When I leave the search box blank in the first file it finds no matches. However when I repeat the same operation with the 2nd file it finds 21 matches from each cell between E1 to G7. I suppose somehow there are some blank/invisible cells in the 2nd file - and that's what's causing read_excel to behave differently.
My goal is to simply execute the pandas read_excel function. However, I'm running into a very strange situation in which I am trying to run pandas.read_excel on two very similar excel files but am getting substantially different results.
Code
import pandas
data1 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_1.xlsx')
print(data1)
data2 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_2.xlsx')
print(data2)
Output
Name DOB Class Year GPA
0 Redeye438 2008-09-22 Fresh 1
1 Redeye439 2009-09-20 Soph 2
2 Redeye440 2010-09-22 Junior 3
3 Redeye441 2011-09-20 Senior 4
4 Redeye442 2008-09-20 Fresh 4
5 Redeye443 2009-09-22 Soph 3
Name DOB Class Year GPA
Redeye438 2011-09-20 Fresh 1 NaN NaN NaN
Redeye439 2010-09-22 Soph 2 NaN NaN NaN
Redeye440 2009-09-20 Junior 3 NaN NaN NaN
Redeye441 2008-09-22 Senior 4 NaN NaN NaN
Redeye442 2011-09-22 Fresh 4 NaN NaN NaN
Redeye443 2010-09-20 Soph 3 NaN NaN NaN
Why are the columns mapped incorrectly for data2?
The excel files in question (the only difference is the data in the DOB column):
Excel file downloads
It looks like you're using an older version of Pandas because I can't reproduce the issue on the latest version.
import pandas as pd
pd.show_versions()
INSTALLED VERSIONS
------------------
python : 3.10.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
pandas : 1.5.0
As you can see below, the columns of data2 are mapped correctly.
import pandas as pd
data1 = pd.read_excel(r"C:\Users\abokey\Downloads\test_1.xlsx")
print(data1)
data2 = pd.read_excel(r"C:\Users\abokey\Downloads\test_2.xlsx")
print(data2)
Name DOB Class Year GPA
0 Redeye438 2008-09-22 Fresh 1
1 Redeye439 2009-09-20 Soph 2
2 Redeye440 2010-09-22 Junior 3
3 Redeye441 2011-09-20 Senior 4
4 Redeye442 2008-09-20 Fresh 4
5 Redeye443 2009-09-22 Soph 3
Name DOB Class Year GPA
0 Redeye438 2011-09-20 Fresh 1
1 Redeye439 2010-09-22 Soph 2
2 Redeye440 2009-09-20 Junior 3
3 Redeye441 2008-09-22 Senior 4
4 Redeye442 2011-09-22 Fresh 4
5 Redeye443 2010-09-20 Soph 3
However, you're right because the format of the two Excel files are not the same. In fact, when looking at test2.xlsx more carefully, this one seems to carry 7x3 blank cells (rows/cols). The latest version of pandas seems to handle this kind of Excel file since the empty cells are ignored when calling pandas.read_excel.
So in principle, upgrading your pandas version should fix the problem :
pip install --upgrade pandas
If the problem persists, clean your Excel file test2.xlsx by doing this :
Open up the file in Excel
Click on Find & Select
Choose Go To Special and Blanks
Click on Clear All
Save your changes
My data has missing values for 'Age' and I want to replace them by average based on groupby column 'Title'. After the command:
df.groupby('Title').mean()['Age']
I get a list for example
Mr 32
Miss 21.7
Ms 28
etc.
I tried:
df['Age'].replace(np.nan, 0, inplace=True)
df[(df.Age==0.0)&(df.Title=='Mr')]
to just see the cells where age is missing and title is of one type but it doesn't work.
Question 1. Why the code above doesn't show any cells, despite multiple cells satisfying both conditions at the same time (age = 0.0 and title is mr)
Question2. How can I replace all missing values based on the group average as described above?
I cannot reproduce the first error, so if i use an example like below:
import pandas as pd
import numpy as np
np.random.seed(111)
df = pd.DataFrame({'Title':np.random.choice(['Mr','Miss','Mrs'],20),'Age':np.random.randint(20,50,20)})
df.loc[[5,9,10,11,12],['Age']]=np.nan
the data frame looks like:
Title Age
0 Mr 42.0
1 Mr 28.0
2 Mr 25.0
3 Mr 32.0
4 Mrs 26.0
5 Miss NaN
6 Mrs 32.0
7 Mrs 33.0
8 Mrs 25.0
9 Mr NaN
10 Miss NaN
11 Mr NaN
12 Mrs NaN
13 Miss 38.0
14 Mr 31.0
15 Mr 42.0
16 Mr 24.0
17 Mrs 23.0
18 Mrs 49.0
19 Miss 27.0
And we can replace it just doing one more step:
ave_age = df.groupby('Title').mean()['Age']
df.loc[pd.isna(df['Age']),'Age'] = ave_age[df.loc[pd.isna(df['Age']),'Title']].values
Question 1:
Please provide a snippet in order to be able to reproduce the error
Question 2:
Try df['Age'].fillna(f.groupby('Title')['Age'].transform('mean')). This is similar to Pandas: filling missing values by mean in each group
I am facing a weird scenario.
I have a data frame with having 3 largest scores for unique row like this:
id rid code score
1 9 67 43
1 8 87 22
1 4 32 20
2 3 56 43
3 10. 22 100
3. 5 67. 50
Here id column is same but row wise it is different.
I want to make my data frame like this:
id first_code second_code third_code
1 67 87 32
2. 56. none. none
3 22. 67. none
So I have made my dataframe which is showing highest top 3 scores. If there is not top 3 value I am taking top 2 or the only value which is the score. So depending on score value, I want to re-arrange the code column into three different columns as example first_code is representing the highest_score, second_score is representing second-highest, third_code is representing the third highest value. If not found then I will make those blanks.
Kindly help me to solve this.
Use GroupBy.cumcount for counter, create MultiIndex and reshape by Series.unstack:
df = df.set_index(['id',df.groupby('id').cumcount()])['code'].unstack()
df.columns=['first_code', 'second_code', 'third_code']
df = df.reset_index()
print (df)
id first_code second_code third_code
0 1.0 67.0 87.0 32.0
1 2.0 56.0 NaN NaN
2 3.0 22.0 67.0 NaN
Btw, cumcount should be used also in previous code for filter top3 values.
I have a DataFrame that looks like this:
user data
0 Kevin 1
1 Kevin 3
2 Sara 5
3 Kevin 23
...
And I want to get the historical values (looking let's say 2 entries forward) as rows:
user data data_1 data_2
0 Kevin 1 3 23
1 Sara 5 24 NaN
2 Kim ...
...
Right now I'm able to do this through the following command:
_temp = df.groupby(['user'], as_index = False)['data']
for i in range(1,2):
data['data_{0}'.format(i)] = _temp.shift(-1)
I feel like my approach is very inefficient and that there is a much faster way to do this (esp. when the number of lookahead/lookback values go up)!
You can use groupby.cumcount() with set_index() and unstack():
m=df.assign(k=df.groupby('user').cumcount().astype(str)).set_index(['user','k']).unstack()
m.columns=m.columns.map('_'.join)
print(m)
data_0 data_1 data_2
user
Kevin 1.0 3.0 23.0
Sara 5.0 NaN NaN
I have a dataframe, with recordings of statistics in multiple columns.
I have a list of the column names: stat_columns = ['Height', 'Speed'].
I want to combine the data to get one row per id.
The data comes sorted with the newest records on the top. I want the most recent data, so I must use the first value of each column, by id.
My dataframe looks like this:
Index id Height Speed
0 100007 8.3
1 100007 54
2 100007 8.6
3 100007 52
4 100035 39
5 100014 44
6 100035 5.6
And I want it to look like this:
Index id Height Speed
0 100007 54 8.3
1 100014 44
2 100035 39 5.6
I have tried a simple groupby myself:
df_stats = df_path.groupby(['id'], as_index=False).first()
But this seems to only give me a row with the first statistic found.
For me your solution working, maybe is necessary replace empty values to NaNs:
df_stats = df_path.replace('',np.nan).groupby('id', as_index=False).first()
print (df_stats)
id Index Height Speed
0 100007 0 54.0 8.3
1 100014 5 44.0 NaN
2 100035 4 39.0 5.6