Pandas Dataframe reads value of one column wrong - python

The csv file:
Link to Github
This is my code:
import pandas as pd
df = pd.read_csv("log_1_2018_09_07.csv", encoding="ISO-8859-1", delimiter=';')
print(df.columns.tolist())
dates = []
times = []
outputs = []
for date in df.loc[:, "Datum"]:
dates.append(date)
print("date")
print(date)
for time in df.loc[:, " Zeit"]:
times.append(time)
print("time")
print(time)
for out in df.iloc[:, 19]:
print("output")
outputs.append(out)
print(out)
It reads the dates and times correctly, but the 19th column (column T) are all 0 and the 6th value is 990, however pandas reads it as all 0 and the 9th value as 1.
Does anybody know why it's reading the wrong values?
Thank you!!

import pandas as pd
url = 'https://raw.github.com/liamrisch/helper/master/log_1_2018_09_07.csv'
df = pd.read_csv(url, encoding="ISO-8859-1", delimiter=';')
df.iloc[:,[6,19]]
Gives:
Teil 1-8 - Abstand Rasthaken MP1-MP2 Teil 1-8
0 26,764 0
1 26,787 0
2 26,792 0
3 26,788 0
4 26,771 0
5 999,990 0
6 26,786 0
7 26,785 0
8 26,780 1
9 26,783 0
10 26,798 0

take a very close look at the data, the value is actually 1 but since 999 takes more visual space it makes the illusion of 999 being the value in that cell.
printing the df in that column (before any maniulation) shows the actual values of that column, without any surprises.

Related

Subtract df1 from df2, df2 from df3 and so on from all data from folder

I have a few data frames as CSV files in the folder.
example1_result.csv
example2_result.csv
example3_result.csv
example4_result.csv
example5_result.csv
My each data frame looks like following
TestID Result1 Result2 Result3
0 0 5 1
1 1 0 4
2 2 1 2
3 3 0 0
4 4 3 0
5 5 0 1
I want to subtract example1_result.csv from example2_result.csv on the Result1, Result2, and Result3 columns, and save it as a new data frame as result1.csv. Then the similar subtraction operation on example2_result.csv from example3_result.csv, and so on.
I want to do it using python scripts. Please help me as I am a novice in python. Thanks.
Given CSVs that look like:
TestID,Result1,Result2,Result3
0,0,5,1
1,1,0,4
2,2,1,2
3,3,0,0
4,4,3,0
5,5,0,1
Doing:
files = ['example1_result.csv', 'example2_result.csv',
'example3_result.csv', 'example4_result.csv',
'example5_result.csv']
dfs = []
for file in files:
# index_col matters, since we don't want to subtract TestID from eachother~
df = pd.read_csv(file, index_col='TestID')
dfs.append(df)
num_dfs = len(dfs)
for i, df in enumerate(dfs):
if i + 1 == num_dfs:
break
df.sub(dfs[i+1]).to_csv(f'result{i+1}.csv')
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
dfresult = pd.DataFrame()
dfresult["Result1"] = df2["Result1"] - df1["Result1"] # do for all columns
dfresult.to_csv("result.csv")

CSV: alternative to excel "IF" statement in python. Read column and create a new one with numpy.where or other function

I have a CSV file with several columns and I want to write a code that will read a specific column called 'ARPU average 6 month w/t roaming and discount' and then, create a new column called "Logical" which will be based on numpy.where(). Here is what I got at the moment:
csv_data = pd.read_csv("Results.csv")
data = csv_data[['ARPU average 6 month w/t roaming and discount']]
data = data.to_numpy()
sol = []
for target in data:
if1 = np.where(data < 0, 1, 0)
sol.append(if1)
csv_data["Logical"] = [sol].values
csv_data.to_csv ('Results2.csv', index = False, header=True)
This loop is made incorrectly and does not work. It does not create a new column with the corresponding value for each row. To make it clear: if the value in the column is bigger than 0, it will record "1", otherwise "0". The solution can be in any way (nor np.where(), nor loop is required)
If you want to understand what is "Results.csv"
It is actually a big file with data, I have highlighted the column we work with. The code needs to check if there is a value bigger than 0 in the column and give back 1 or 0 in the new column (as I described in the question)
updated answer
import pandas as pd
f1 = pd.read_csv("L1.csv")
f2 = pd.read_csv("L2.csv")
f3 = pd.merge(f1, f2, on ='CTN', how ='inner')
# f3.to_csv("Results.csv") # -> you do not need to save the file to a csv unless you really want to
# csv_data = pd.read_csv("Results.csv") # -> f3 is already saved in memory you do not need to read it again
# data = csv_data[['ARPU average 6 month w/t roaming and discount']] # -> you do not need this line
f3['Logical'] = (f3['ARPU average 6 month w/t roaming and discount']>0).astype(int)
f3.to_csv('Results2.csv', index = False, header=True)
original answer
Generally you do not need to use a loop when using pandas or numpy. Take this sample dataframe: df = pd.DataFrame([0,1,2,3,0,0,0,1], columns=['data'])
You can simply use the boolean values returned (where column is greater than 0 return 1 else return 0) to create a new column.
df['new_col'] = (df['data'] > 0).astype(int)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1
or if you want to us numpy
df['new_col'] = np.where(df['data']>0, 1, 0)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1

Half of dataframe csv file missing column

I have a csv file which is continuously updated with new data. The data has 5 columns, however lately the data have been changed to 4 columns. The column that is not present in later data is the first one. When I try to read this csv file into a dataframe, half of the data is under the wrong columns. The data is around 50k entries.
df
################################
0 time C-3 C-4 C-5
______________________________
0 1 str 4 5 <- old entries
0 1 str 4 5
1 str 4 5 Nan <- new entries
1 str 4 5 Nan
1 str 4 5 Nan
#################################
The first column in earlier entries (where value = 0) are not important.
My expected output is the Dataframe being printed with the right values in the right columns. I have no idea on how to add the 0 before the str where the 0 is missing. Or reverse, remove the 0. (since the 0 looks to be a index counter with value starting at 1, then 2, etc.)
Here is how i load and process the csv file at the moment:
def process_csv(obj):
data='some_data'
path='data/'
file=f'{obj}_{data}.csv'
df=pd.read_csv(path+file)
df=df.rename(columns=df.iloc[0]).drop(df.index[0])
df = df[df.time != 'time']
mask = df['time'].str.contains(' ')
df['time'] = (pd.to_datetime(df.loc[mask,'time'])
.reindex(df.index)
.fillna(pd.to_datetime(df.loc[~mask, 'time'], unit='ms')))
df=df.drop_duplicates(subset=['time'])
df=df.set_index("time")
df = df.sort_index()
return df
since column time should be of type int, a missing first column causes str values that should end up in C-3 to end up in column time which causes error:
ValueError: non convertible value str with the unit 'ms'
Question: How can I either remove the early values from column 0 or add some values to the later entries?
If one CSV file contains multiple formats, and these CSV files cannot be changed, then you could parse the files before creating data frames.
For example, the test data has both 3- and 4-field records. The function gen_read_lines() always returns 3 fields per records.
from io import StringIO
import csv
import pandas as pd
data = '''1,10,11
2,20,21
3,-1,30,31
4,-2,40,41
'''
def gen_read_lines(filename):
lines = csv.reader(StringIO(filename))
for record in lines:
if len(record) == 3: # return all fields
yield record
elif len(record) == 4: # drop field 1
yield [record[i] for i in [0, 2, 3]]
else:
raise ValueError(f'did not expect {len(records)}')
records = (record for record in gen_read_lines(data))
df = pd.DataFrame(records, columns=['A', 'B', 'C'])
print(df)
A B C
0 1 10 11
1 2 20 21
2 3 30 31
3 4 40 41

How to skip the first two lines when reading in multiple files into a Pandas df [duplicate]

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.

How do I skip first rows with Pandas in Python [duplicate]

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.

Categories