I have a csv file which is continuously updated with new data. The data has 5 columns, however lately the data have been changed to 4 columns. The column that is not present in later data is the first one. When I try to read this csv file into a dataframe, half of the data is under the wrong columns. The data is around 50k entries.
df
################################
0 time C-3 C-4 C-5
______________________________
0 1 str 4 5 <- old entries
0 1 str 4 5
1 str 4 5 Nan <- new entries
1 str 4 5 Nan
1 str 4 5 Nan
#################################
The first column in earlier entries (where value = 0) are not important.
My expected output is the Dataframe being printed with the right values in the right columns. I have no idea on how to add the 0 before the str where the 0 is missing. Or reverse, remove the 0. (since the 0 looks to be a index counter with value starting at 1, then 2, etc.)
Here is how i load and process the csv file at the moment:
def process_csv(obj):
data='some_data'
path='data/'
file=f'{obj}_{data}.csv'
df=pd.read_csv(path+file)
df=df.rename(columns=df.iloc[0]).drop(df.index[0])
df = df[df.time != 'time']
mask = df['time'].str.contains(' ')
df['time'] = (pd.to_datetime(df.loc[mask,'time'])
.reindex(df.index)
.fillna(pd.to_datetime(df.loc[~mask, 'time'], unit='ms')))
df=df.drop_duplicates(subset=['time'])
df=df.set_index("time")
df = df.sort_index()
return df
since column time should be of type int, a missing first column causes str values that should end up in C-3 to end up in column time which causes error:
ValueError: non convertible value str with the unit 'ms'
Question: How can I either remove the early values from column 0 or add some values to the later entries?
If one CSV file contains multiple formats, and these CSV files cannot be changed, then you could parse the files before creating data frames.
For example, the test data has both 3- and 4-field records. The function gen_read_lines() always returns 3 fields per records.
from io import StringIO
import csv
import pandas as pd
data = '''1,10,11
2,20,21
3,-1,30,31
4,-2,40,41
'''
def gen_read_lines(filename):
lines = csv.reader(StringIO(filename))
for record in lines:
if len(record) == 3: # return all fields
yield record
elif len(record) == 4: # drop field 1
yield [record[i] for i in [0, 2, 3]]
else:
raise ValueError(f'did not expect {len(records)}')
records = (record for record in gen_read_lines(data))
df = pd.DataFrame(records, columns=['A', 'B', 'C'])
print(df)
A B C
0 1 10 11
1 2 20 21
2 3 30 31
3 4 40 41
Related
enter image description here
I want to print out all row values corresponding to a specific month in Excel using Python.
Please look at the picture.
If I put "2021.10" in the function value using the function, I want the values [2021.10.03", "2021.10.03", "2021.10.03", "2021.10.03","2021.10.03", None "2021.10.15", "2021.10.15", None] to be entered in the list.
Simply, if I put the value "2021.10" into the function, I would like to extract the values from rows 5 to 13 of column A.
What should I do?
I'm reading the sheet in openpyxl now.
import openpyxl as oxl
load_excel = oxl.load_workbook('C:/Users/Homework.xlsx',data_only = True)
load_sheet = load_excel['Sheet']
you can declare a list first and then write a loop which will traverse the rows
which value you need. After every iteration you can append the value to the list.
after completing the loop you will get your desired extracted values.
list = []
for i,j in range of(3,10):
list.append(sheet.cell(row=i, col=j).value = list[i][j])
This can extract rows as what you want
# dataframe from the excel file
A B C
0 2021.09.23 E 1
1 2021.09.23 A 1
2 2021.09.23 E 1
3 None None 3
4 2021.10.03 A 1
5 2021.10.03 A 2
6 2021.10.03 B 2
7 2021.10.03 E 1
8 2021.10.03 A 1
9 None None 7
10 2021.10.15 A 2
11 2021.10.15 B 3
12 None None 5
13 2021.11.03 C 2
14 2021.11.03 B 1
15 2021.11.03 F 2
import pandas as pd
import openpyxl
def extract(df, value):
df = df.reset_index() # to make index column
first_index = df[(df['A'].str.startswith(value))].iloc[0]['index'] # to get index of first 2021.10 value
last_index = df[(df['A'].str.startswith(value))].iloc[-1]['index'] # to get index of last 2021.10 value
sub_df = df.iloc[first_index:last_index+1,] # to make dataframe from first index to last index
for i, row in df[last_index+1:].iterrows():
# to add all None right after the last value
if row['A'] == 'None':
sub_df = sub_df.append(row)
else:
break
print(sub_df['A'].to_list())
df = pd.read_excel('C:/Users/Homework.xlsx') # I read xlsx file using pandas instead of `openpyxl`
extract(df, '2021.10')
This will be printed by the function:
['2021.10.03', '2021.10.03', '2021.10.03', '2021.10.03', '2021.10.03', 'None', '2021.10.15', '2021.10.15', 'None']
In case that you want to read a excel file using openpyxl, you can change last three lines as follows:
wb = openpyxl.load_workbook('test.xlsx')
sheet = wb.active
df = pd.DataFrame(sheet.values) # convert a data to Pandas dataframe
df.columns = df.iloc[0, :] # to use first row as column names
df = df.iloc[1:, :] # to use the other rows as a data
extract(df, '2021.10')
I have a CSV file with several columns and I want to write a code that will read a specific column called 'ARPU average 6 month w/t roaming and discount' and then, create a new column called "Logical" which will be based on numpy.where(). Here is what I got at the moment:
csv_data = pd.read_csv("Results.csv")
data = csv_data[['ARPU average 6 month w/t roaming and discount']]
data = data.to_numpy()
sol = []
for target in data:
if1 = np.where(data < 0, 1, 0)
sol.append(if1)
csv_data["Logical"] = [sol].values
csv_data.to_csv ('Results2.csv', index = False, header=True)
This loop is made incorrectly and does not work. It does not create a new column with the corresponding value for each row. To make it clear: if the value in the column is bigger than 0, it will record "1", otherwise "0". The solution can be in any way (nor np.where(), nor loop is required)
If you want to understand what is "Results.csv"
It is actually a big file with data, I have highlighted the column we work with. The code needs to check if there is a value bigger than 0 in the column and give back 1 or 0 in the new column (as I described in the question)
updated answer
import pandas as pd
f1 = pd.read_csv("L1.csv")
f2 = pd.read_csv("L2.csv")
f3 = pd.merge(f1, f2, on ='CTN', how ='inner')
# f3.to_csv("Results.csv") # -> you do not need to save the file to a csv unless you really want to
# csv_data = pd.read_csv("Results.csv") # -> f3 is already saved in memory you do not need to read it again
# data = csv_data[['ARPU average 6 month w/t roaming and discount']] # -> you do not need this line
f3['Logical'] = (f3['ARPU average 6 month w/t roaming and discount']>0).astype(int)
f3.to_csv('Results2.csv', index = False, header=True)
original answer
Generally you do not need to use a loop when using pandas or numpy. Take this sample dataframe: df = pd.DataFrame([0,1,2,3,0,0,0,1], columns=['data'])
You can simply use the boolean values returned (where column is greater than 0 return 1 else return 0) to create a new column.
df['new_col'] = (df['data'] > 0).astype(int)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1
or if you want to us numpy
df['new_col'] = np.where(df['data']>0, 1, 0)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1
I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.
I have a big dataframe of around 30000 rows and a single column containing a json string. Each json string contains a number of variables and its value I want to break this json string down into columns of data
two rows looks like
0 {"a":"1","b":"2","c":"3"}
1 {"a" ;"4","b":"5","c":"6"}
I want to convert this into a dataframe like
a b c
1 2 3
4 5 6
Please help
Your column values seem to have an extra number before the actual json string. So you might want strip that out first (skip to Method if that isn't the case)
One way to do that is to apply a function to the column
# constructing the df
df = pd.DataFrame([['0 {"a":"1","b":"2","c":"3"}'],['1 {"a" :"4","b":"5","c":"6"}']], columns=['json'])
# print(df)
json
# 0 0 {"a":"1","b":"2","c":"3"}
# 1 1 {"a" :"4","b":"5","c":"6"}
# function to remove the number
import re
def split_num(val):
p = re.compile("({.*)")
return p.search(val).group(1)
# applying the function
df['json'] = df['json'].map(lambda x: split_num(x))
print(df)
# json
# 0 {"a":"1","b":"2","c":"3"}
# 1 {"a" :"4","b":"5","c":"6"}
Method:
Once the df is in the above format, the below will convert each row entry to a dictionary:
df['json'] = df['json'].map(lambda x: dict(eval(x)))
Then, applying pd.Series to the column will do the job
d = df['json'].apply(pd.Series)
print(d)
# a b c
# 0 1 2 3
# 1 4 5 6
with open(json_file) as f:
df = pd.DataFrame(json.loads(line) for line in f)
I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.