Pandas read_csv adds unnecessary " " to each row

Pandas read_csv adds unnecessary " " to each row - python

I have a csv file
(I am showing the first three rows here)
HEIGHT,WEIGHT,AGE,GENDER,SMOKES,ALCOHOL,EXERCISE,TRT,PULSE1,PULSE2,YEAR
173,57,18,2,2,1,2,2,86,88,93
179,58,19,2,2,1,2,1,82,150,93
I am using pandas read_csv to read the file and put them into columns.
Here is my code:
import pandas as pd
import os
path='~/Desktop/pulse.csv'
path=os.path.expanduser(path)
my_data=pd.read_csv(path, index_col=False, header=None, quoting = 3, delimiter=',')
print my_data
The problem is the first and last columns have " before and after the values.
Additionally I can't get rid of the indexes.
It might be making some silly mistake but I thank you for your help in advance

Final solution - use replace with converting to ints and for remove " from columns names use strip:
df = pd.read_csv('pulse.csv', quoting=3)
df = df.replace('"','', regex=True).astype(int)
df.columns = df.columns.str.strip('"')
print (df.head())
HEIGHT WEIGHT AGE GENDER SMOKES ALCOHOL EXERCISE TRT PULSE1 \
0 173 57 18 2 2 1 2 2 86
1 179 58 19 2 2 1 2 1 82
2 167 62 18 2 2 1 1 1 96
3 195 84 18 1 2 1 1 2 71
4 173 64 18 2 2 1 3 2 90
PULSE2 YEAR
0 88 93
1 150 93
2 176 93
3 73 93
4 88 93
index_col=False means force not read first column to index, but dataframe always need some index, so is added default - 0,1,2.... So here can be omit.
header=None should be removed because it force dont read first row (header of csv) to columns of DataFrame. Then also first row of data is header and numeric values are converted to strings.
delimiter=',' should be removed too, because it is same as sep=',' what is default parameter.

#jezrael is right - a pandas dataframe will always add indices. It's necessary.
try something like df[0] = df[0].str.strip() replacing zero with the last column.
before you do so, convert your csv to a dataframe - pd.DataFrame.from_csv(path)

Related

how to combine the first 2 column in pandas/python with n/a value

I have some questions about combining the first 2 columns in pandas/python with n/a value
long story: I need to read an excel and alter those changes. I can not change anything in excel, so any change has been done by python.
Here is the excel input
and the expected expect output will be
I manage to read it in, but when I try to combine the first 2 columns, I have some problems. since in excel, the first row is merged, so once it is read in. only one row has value, but the rest of the row is all N/A.
such as below:
Year number 2016
Month Jan
Month 2016-01
Grade 1 100
NaN 2 99
NaN 3 98
NaN 4 96
NaN 5 92
NaN Total 485
Is there any function that can easily help me to combine the first two columns and make it as below:
Year 2016
Month Jan
Month 2016-01
Grade 1 100
Grade 2 99
Grade 3 98
Grade 4 96
Grade 5 92
Grade Total 485
Anything will be really appreciated.
I searched and google the key word for so long but did not find any answer that fits my situation here.

d = '''
Year,number,2016
Month,,Jan
Month,,2016-01
Grade,1, 100
NaN,2, 99
NaN,3, 98
NaN,4, 96
NaN,5, 92
NaN,Total,485
'''
df = pd.read_csv(StringIO(d))
df
df['Year'] = df.Year.fillna(method='ffill')
df = df.fillna('') # skip this step if your data from excel does not have nan in col 2.
df['Year'] = df.Year + ' ' + df.number.astype('str')
df = df.drop('number',axis=1)
df

How to group by a df in Python by a column with the difference between the max value of one column and the min of another column?

I have a data frame which looks like this:
student_id
session_id
reading_level_id
st_week
end_week
1
3334
3
3
3
1
3335
2
4
4
2
3335
2
2
2
2
3336
2
2
3
2
3337
2
3
3
2
3339
2
3
4
...
There are multiple session_id's, st_weeks and end_weeks for every student_id. Im trying to group the data by 'student_id' and I want to calculate the difference between the maximum(end_week) and the minimum (st_week) for each student.
Aiming for an output that would look something like this:
Student_id
Diff
1
1
2
2
....
I am relatively new to Python as well as Stack Overflow and have been trying to find an appropriate solution - any help is appreciated.

Using the data you shared, a simpler solution is possible:
Group by student_id, and pass False argument to the as_index parameter (this works for a dataframe, and returns a dataframe);
Next, use a named aggregation to get the `max week for end week and the min week for st_week for each group
Get the difference between max_wk and end_wk
Finally, keep only the required columns
(
df.groupby("student_id", as_index=False)
.agg(max_wk=("end_week", "max"), min_wk=("st_week", "min"))
.assign(Diff=lambda x: x["max_wk"] - x["min_wk"])
.loc[:, ["student_id", "Diff"]]
)
student_id Diff
0 1 1
1 2 2

There's probably a more efficient way to do this, but I broke this into separate steps for the grouping to get max and min values for each id, and then created a new column representing the difference. I used numpy's randint() function in this example because I didn't have access to a sample dataframe.
import pandas as pd
import numpy as np
# generate dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(1200, 4)), columns=['student_id', 'session_id', 'st_week', 'end_week'])
# use groupby to get max and min for each student_id
max_vals = df.groupby(['student_id'], sort=False)['end_week'].max().to_frame()
min_vals = df.groupby(['student_id'], sort=False)['st_week'].min().to_frame()
# use join to put max and min back together in one dataframe
merged = min_vals.join(max_vals)
# use assign() to calculate difference as new column
merged = merged.assign(difference=lambda x: x.end_week - x.st_week).reset_index()
merged
student_id st_week end_week difference
0 40 2 99 97
1 23 5 74 69
2 78 9 93 84
3 11 1 97 96
4 97 24 88 64
... ... ... ... ...
95 54 0 96 96
96 18 0 99 99
97 8 18 97 79
98 75 21 97 76
99 33 14 93 79

You can create a custom function and apply it to a group-by over students:
def week_diff(g):
return g.end_week.max() - g.st_week.min()
df.groupby("student_id").apply(week_diff)
Result:
student_id
1 1
2 2
dtype: int64

Average certain columns based on values in other columns

I would like to average certain column values depending on whether a condition is met in another column. Specifically, if column 1 in the below dataframe is < 1700, I want to include the corresponding value in that row from column 51 in my average calculation. And if column 2 < 1700, I want to also include the value in that row from column 52 in my average calculation.
So, for row 0, the new calculated column for that row would be 64 (average of 65 & 63). For row 1, the average would be just 80 (column 51 value) since neither columns 2 nor 3 were less than 1700 and hence not included in the average calculation.
This is a simplified example as my actual dataframe has about 10 columns for conditions with 10 corresponding columns of values to average.
As a potential complexity, the column headers are numbers rather than traditional text labels and do not refer to the order of that column in the dataframe since I've excluded certain columns when I imported the csv file. In other words, column 51 isn't the 51st column in the dataframe.
When I run the below code I'm getting the following error:
ValueError: ("No axis named 1 for object type ",
'occurred at index 0')
Is there a more efficient way to code this and avoid this error? Thanks for your help!
import pandas as pd
import numpy as np
test_df = pd.DataFrame({1:[1600,1600,1600,1700,1800],2:[1500,2000,1400,1500,2000],
3:[2000,2000,2000,2000,2000],51:[65,80,75,80,75],52:[63,82,85,85,75],53:[83,80,75,76,78]})
test_df
1 2 3 51 52 53
0 1600 1500 2000 65 63 83
1 1600 2000 2000 80 82 80
2 1600 1400 2000 75 85 75
3 1700 1500 2000 80 85 76
4 1800 2000 2000 75 75 78
def calc_mean_based_on_conditions(row):
list_of_columns_to_average = []
for i in range(1,4):
if row[i] < 1700:
list_of_columns_to_average.append(i+50)
if not list_of_columns_to_average:
return np.nan
else:
return row[(list_of_columns_to_average)].mean(axis=1)
test_df['MeanValue'] = test_df.apply(calc_mean_based_on_conditions, axis=1)

Something very relevant (supporting int as column names)- https://github.com/theislab/anndata/issues/31
Due to this bug/issue, I converted the column names to type string:
test_df = pd.DataFrame({'1':[1600,1600,1600,1700,1800],'2':[1500,2000,1400,1500,2000],
'3':[2000,2000,2000,2000,2000],'51':[65,80,75,80,75],'52':[63,82,85,85,75],'53':
[83,80,75,76,78]})
Created a new dataframe - new_df to meet out requirements
new_df = test_df[['1', '2', '3']].where(test_df[['1','2','3']]<1700).notnull()
new_df now looks like this
1 2 3
0 True True False
1 True False False
2 True True False
3 False True False
4 False False False
Then simply rename the column and check using 'where'
new_df = new_df.rename(columns={"1": "51", "2":"52", "3":"53"})
test_df['mean_value'] = test_df[['51', '52', '53']].where(new_df).mean(axis=1)
This should give you the desired output -
1 2 3 51 52 53 mean_value
0 1600 1500 2000 65 63 83 64.0
1 1600 2000 2000 80 82 80 80.0
2 1600 1400 2000 75 85 75 80.0
3 1700 1500 2000 80 85 76 85.0
4 1800 2000 2000 75 75 78 NaN

I deleted my other answer because it was going down the wrong path. What you want to do is generate a mask of your conditional columns, then use that mask to apply a function to other columns. In this case, 1 corresponds to 51, 2 to 52, etc.
import pandas as pd
import numpy as np
test_df = pd.DataFrame({1:[1600,1600,1600,1700,1800],2:[1500,2000,1400,1500,2000],
3:[2000,2000,2000,2000,2000],51:[65,80,75,80,75],52:[63,82,85,85,75],53:[83,80,75,76,78]})
test_df
1 2 3 51 52 53
0 1600 1500 2000 65 63 83
1 1600 2000 2000 80 82 80
2 1600 1400 2000 75 85 75
3 1700 1500 2000 80 85 76
4 1800 2000 2000 75 75 78
# create dictionary to map columns to one another
l1=list(range(1,4))
l2=list(range(50,54))
d = {k:v for k,v in zip(l1,l2)}
d
{1: 51, 2: 52, 3: 53}
temp=test_df[l1] > 1700 # Subset initial dataframe, generate mask
for _, row in temp.iterrows(): #iterate through subsetted data
list_of_columns_for_mean=list() # list of columns for later computation
for k, v in d.items(): #iterate through each k:v and evaluate conditional for each row
if row[k]:
list_of_columns_for_mean.append(v)
# the rest should be pretty easy to figure out
This is not an elegant solution, but it is a solution. Unfortunately, I've run out of time to dedicate to it, but hopefully this gets you pointed in a better direction.

There is probably a better, vectorized way to do this, but you could do it without the function
import numpy as np
import pandas as pd
from collections import defaultdict
test_df = pd.DataFrame({1:[1600,1600,1600,1700,1800],2:[1500,2000,1400,1500,2000],
3:[2000,2000,2000,2000,2000],51:[65,80,75,80,75],52:[63,82,85,85,75],53:[83,80,75,76,78]})
# List of columns that you're applying the condition to
condition_cols = list(range(1,4))
# Get row and column indices where this condition is true
condition = np.where(test_df[condition_cols].lt(1700))
# make a dictionary mapping row to true columns
cond_map = defaultdict(list)
for r,c in zip(*condition):
cond_map[r].append(c)
# Get the means of true columns
means = []
for row in range(len(test_df)):
if row in cond_map:
temp = []
for col in cond_map[row]:
# Needs 51 because of Python indexing starting at zero + 50
temp.append(test_df.loc[row, col+51])
means.append(temp)
else:
# If the row has no true columns (i.e row 4)
means.append(np.nan)
test_df['Means'] = [np.mean(l) for l in means]
The issue is indexing true rows and columns in a vectorized way.

Reading a Specific Row of a CSV File based on the 1st Occurrence of a value within a Colum

Below is the CSV File that I have:
Record Time Value 1 Value 2 Value 3
Event 1 20 35 40
Event 2 48 43 56
Event 3 45 58 90
FFC 4 12 89 94
FFC 5 30 25 60
Event 6 99 45 13
I would like to use pandas in order to parse through the 'Record' Column until I find the first FFC and then print that entire row. Additionally, I would like to print the row that is two above the first found FFC. Any suggestions on how to approach this?
My reasoning for wanting to use Pandas is that I am going to need to call upon specific values within the two printed rows and plot them.
To start I have:
csvfile = pd.read_csv('Test.csv')
print(csvfile)
Thank you very much for your assistance, it is greatly appreciated!

This is one way.
import pandas as pd
from io import StringIO
mystr = StringIO("""Record Time Value1 Value2 Value3
Event 1 20 35 40
Event 2 48 43 56
Event 3 45 58 90
FFC 4 12 89 94
FFC 5 30 25 60
Event 6 99 45 13""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr, delim_whitespace=True)
# get index of condition
idx = df[df['Record'] == 'FFC'].index[0]
# filter for appropriate indices
res1 = df.loc[idx]
res2 = df.loc[idx-2]
To output a dataframe:
print(res1.to_frame().T)
# Record Time Value1 Value2 Value3
# 3 FFC 4 12 89 94

Numpy 2D array in Python 3.4

I have this code:
import pandas as pd
data = pd.read_csv("test.csv", sep=",")
data array looks like that:
The problem is that I can't split it by columns, like that:
week = data[:,1]
It should split the second column into the week, but it doesn't do it:
*TypeError: unhashable type: 'slice'
*
How should I do this to make it work?
I also wondering, that what this code do exactly? (Don't really understand np.newaxis part)
week = data['1'][:, np.newaxis]
Result:

There are a few issues here.
First, read_csv uses a comma as a separator by default, so you don't need to specify that.
Second, the pandas csv reader by default uses the first row to get column headings. That doesn't appear to be what you want, so you need to use the header=None argument.
Third, it looks like your first column is the row number. You can use index_col=0 to use that column as the index.
Fourth, for pandas, the first index is the column, not the row. Further, using the standard data[ind] notation is indexing by column name, rather than column number. And you can't use a comma to index both row and column at the same time (you need to use data.loc[row, col] to do that).
So for your case, all you need to do to get the second columns is data[2], or if you use the first column as the row number then the second column becomes the first column, so you would do data[1]. This returns a pandas Series, which is the 1D equivalent of a 2D DataFrame.
So the whole thing should look like this:
import pandas as pd
data = pd.read_csv('test.csv', header=None, index_col=0)
week = data[1]
data looks like this:
1 2 3 4
0
1 10 2 100 12
2 15 5 150 15
3 25 7 240 20
4 22 12 350 20
5 51 13 552 20
6 134 20 880 36
7 150 22 900 38
8 200 29 1020 44
9 212 31 1100 46
10 199 23 1089 45
11 220 32 1145 60
The '0' row doesn't exist, it is just there for informational purposes.
week looks like this:
0
1 10
2 15
3 25
4 22
5 51
6 134
7 150
8 200
9 212
10 199
11 220
Name: 1, dtype: int64
However, you can give columns (and rows) meaningful names in pandas, and then access them by those names. I don't know the column names, so I just made some up:
import pandas as pd
data = pd.read_csv('test.csv', header=None, index_col=0, names=['week', 'spam', 'eggs', 'grail'])
week = data['week']
In this case, data looks like this:
week spam eggs grail
1 10 2 100 12
2 15 5 150 15
3 25 7 240 20
4 33 12 350 20
5 51 13 552 20
6 134 20 880 36
7 150 22 900 38
8 200 29 1020 44
9 212 31 1100 46
10 199 23 1089 45
11 220 32 1145 50
And week looks like this:
1 10
2 15
3 25
4 33
5 51
6 134
7 150
8 200
9 212
10 199
11 220
Name: week, dtype: int64
For np.newaxis, what that does is add one dimension to the array. So say you have a 1D array (a vector), using np.newaxis on it would turn it into a 2D array. It would turn a 2D array into a 3D array, 3D into 4D, and so on. Depending on where you put it (such as [:,np.newaxis] vs. [np.newaxis,:], you can determine which dimension to add. So np.arange(10)[np.newaxis,:] (or just np.arange(10)[np.newaxis]) gives you a shape (1,10) 2D array, while np.arange(10)[:,np.newaxis] gives you a shape (10,1) 2D array.
In your case, what the line is doing is getting the column with the name 1, which is a 1D pandas Series, then adding a new dimension to it. However, instead of turning it back into a DataFrame, it instead converts it into a 1D numpy array, then adds one dimension to make it a 2D numpy array.
This, however, is dangerous long-term. There is no guarantee that this sort of silent conversion won't be changed at some point. To change a pandas objects to a numpy one, you should use an explicit conversion with the values method, so in your cases data.values or data['1'].values.
However, you don't really need a numpy array. A series is fine. If you really want a 2D object, you can convert a Series into a DataFrame by using something like data['1'].to_frame().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas read_csv adds unnecessary " " to each row - python

#jezrael is right - a pandas dataframe will always add indices. It's necessary. try something like df[0] = df[0].str.strip() replacing zero with the last column. before you do so, convert your csv to a dataframe - pd.DataFrame.from_csv(path)

Related

how to combine the first 2 column in pandas/python with n/a value

How to group by a df in Python by a column with the difference between the max value of one column and the min of another column?

Average certain columns based on values in other columns

Reading a Specific Row of a CSV File based on the 1st Occurrence of a value within a Colum

Numpy 2D array in Python 3.4

Categories

Resources