reading file with missing values in python pandas - python

I try to read .txt with missing values using pandas.read_csv. My data is of the format:
10/08/2012,12:10:10,name1,0.81,4.02,50;18.5701400N,4;07.7693770E,7.92,10.50,0.0106,4.30,0.0301
10/08/2012,12:10:11,name2,,,,,10.87,1.40,0.0099,9.70,0.0686
with thousands of samples with same name of the point, gps position, and other readings.
I use a code:
myData = read_csv('~/data.txt', sep=',', na_values='')
The code is wrong as na_values does not gives NaN or other indicator. Columns should have the same size but I finish with different length.
I don't know what exactly should be typed in after na_values (did try all different things).
Thanks

The parameter na_values must be "list like" (see this answer).
A string is "list like" so:
na_values='abc' # would transform the letters 'a', 'b' and 'c' each into `nan`
# is equivalent to
na_values=['a','b','c']
Similarly:
na_values=''
# is equivalent to
na_values=[] # and this is not what you want!
This means that you need to use na_values=[''].

What version of pandas are you on? Interpreting empty string as NaN is the default behavior for pandas and seem to parse the empty strings fine in your data snippet both in v0.7.3 and current master without using the na_values parameter at all.
In [10]: data = """\
10/08/2012,12:10:10,name1,0.81,4.02,50;18.5701400N,4;07.7693770E,7.92,10.50,0.0106,4.30,0.0301
10/08/2012,12:10:11,name2,,,,,10.87,1.40,0.0099,9.70,0.0686
"""
In [11]: read_csv(StringIO(data), header=None).T
Out[11]:
0 1
X.1 10/08/2012 10/08/2012
X.2 12:10:10 12:10:11
X.3 name1 name2
X.4 0.81 NaN
X.5 4.02 NaN
X.6 50;18.5701400N NaN
X.7 4;07.7693770E NaN
X.8 7.92 10.87
X.9 10.5 1.4
X.10 0.0106 0.0099
X.11 4.3 9.7
X.12 0.0301 0.0686

Related

pandas split string with $ special text style

I have an excel, the data has two $ , when I read it using pandas, it will convert them to a very special text style.
import pandas as pd
df = pd.DataFrame({ 'Bid-Ask':['$185.25 - $186.10','$10.85 - $11.10','$14.70 - $15.10']})
after pd.read_excel
df['Bid'] = df['Bid-Ask'].str.split('−').str[0]
above code doesn't work due to $ make my string into a special style text.the Split function doesn't work.
my expected result
Do not split. Using str.extract is likely the most robust:
df[['Bid', 'Ask']] = df['Bid-Ask'].str.extract(r'(\d+(?:\.\d+)?)\D*(\d+(?:\.\d+)?)')
Output:
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
There is a non-breaking space (\xa0) in your string. That's why the split doesn't work.
I copied the strings (of your df) one by one into an Excel file and then imported it with pd.read_excel.
The column looks like this then:
repr(df['Bid-Ask'])
'0 $185.25\xa0- $186.10\n1 $10.85\xa0- $11.10\n2 $14.70\xa0- $15.10\nName: Bid-Ask, dtype: object'
Before splitting you can replace that and it'll work.
df['Bid-Ask'] = df['Bid-Ask'].astype('str').str.replace('\xa0',' ', regex=False)
df[['Bid', 'Ask']] = df['Bid-Ask'].str.replace('$','', regex=False).str.split('-',expand = True)
print(df)
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
You have to use the lambda function and apply the method together to split the column values into two and slice the value
df['Bid'] = df['Bid-Ask'].apply(lambda x: x.split('-')[0].strip()[1:])
df['Ask'] = df['Bid-Ask'].apply(lambda x: x.split('-')[1].strip()[1:])
output:
Bid-Ask Bid Ask
0 185.25− 186.10 185.25 186.1
1 10.85− 11.10 10.85 11.1
2 14.70− 15.10 14.70 15.1

Convert heavily nested json file into R/Python dataframe

I have found a numerous number of similar questions on stackoverflow, however, one issue remains unsolved to me. I have a heavily nested “.json” file I need to import and convert into R or Python data.frame to work with. Json file contains lists inside (usually empty but sometime contains data). Example of json's structure:
I use R's library jsonlite and Python's pandas.
# R
jsonlite::fromJSON(json_file, flatten = TRUE)
# or
jsonlite::read_json(json_file, simplifyVector = T)
# Python
with open(json_file.json, encoding = "utf-8") as f:
data = json.load(f)
pd.json_normalize(data)
Generally, in both cases it work. The output looks like a normal data.frame, however, the problem is that some columns of a new data.frame contain an embedded lists (I am not sure about "embedded lists" whether it's correct and clear). Seems that both Pandas and jsonlite combined each list into single column, which is clearly seen in the screens below.
R
Python
As you might see some columns, such as wymagania.wymaganiaKonieczne.wyksztalcenia is nothing but a vector contains a combined/embedded list, i.e. content of a list has been combined into single column.
As a desired output I want to split each element of such lists as a single column of a data.frame. In other words, I want to obtain normal “in tidy sense” data.frame without any nested either data.frames and lists. Both R and Python codes are appreciated.
Minimum reproducible example:
[
{
"warunkiPracyIPlacy":{"miejscePracy":"abc","rodzajObowiazkow":"abc","zakresObowiazkow":"abc","rodzajZatrudnienia":"abc","kodRodzajuZatrudnienia":"abc","zmianowosc":"abc"},
"wymagania":{
"wymaganiaKonieczne":{
"zawody":[],
"wyksztalcenia":["abc"],
"wyksztalceniaSzczegoly":[{"kodPoziomuWyksztalcenia":"RPs002|WY","kodTypuWyksztalcenia":"abc"}],
"jezyki":[],
"jezykiSzczegoly":[],
"uprawnienia":[]},
"wymaganiaPozadane":{
"zawody":[],
"zawodySzczegoly":[],
"staze":[]},
"wymaganiaDodatkowe":{"zawody":[],"zawodySzczegoly":[]},
"inneWymagania":"abc"
},
"danePracodawcy":{"pracodawca":"abc","nip":"abc","regon":"abc","branza":null},
"pozostaleDane":{"identyfikatorOferty":"abc","ofertaZgloszonaPrzez":"abc","ofertaZgloszonaPrzezKodJednostki":"abc"},
"typOferty":"abc",
"typOfertyNaglowek":"abc",
"rodzajOferty":["DLA_ZAREJESTROWANYCH"],"staz":false,"link":false}
]
This is an answer for Python. It is not very elegant, but I think it will do for your purpose.
I have called your example file nested_json.json
import json
import pandas as pd
json_file = "nested_json.json"
with open(json_file, encoding="utf-8") as f:
data = json.load(f)
df = pd.json_normalize(data)
df_exploded = df.apply(lambda x: x.explode()).reset_index(drop=True)
# check based on first row whether its of type dict
columns_dict = df_exploded.columns[df_exploded.apply(lambda x: isinstance(x[0], dict))]
# append the splitted dict to the dataframe
for col in columns_dict:
df_splitted_dict = df_exploded[col].apply(pd.Series)
df_exploded = pd.concat([df_exploded, df_splitted_dict], axis=1)
This leads to a rectangular dataframe
>>> df_exploded.T
0
typOferty abc
typOfertyNaglowek abc
rodzajOferty DLA_ZAREJESTROWANYCH
staz False
link False
warunkiPracyIPlacy.miejscePracy abc
warunkiPracyIPlacy.rodzajObowiazkow abc
warunkiPracyIPlacy.zakresObowiazkow abc
warunkiPracyIPlacy.rodzajZatrudnienia abc
warunkiPracyIPlacy.kodRodzajuZatrudnienia abc
warunkiPracyIPlacy.zmianowosc abc
wymagania.wymaganiaKonieczne.zawody NaN
wymagania.wymaganiaKonieczne.wyksztalcenia abc
wymagania.wymaganiaKonieczne.wyksztalceniaSzcze... {'kodPoziomuWyksztalcenia': 'RPs002|WY', 'kodT...
wymagania.wymaganiaKonieczne.jezyki NaN
wymagania.wymaganiaKonieczne.jezykiSzczegoly NaN
wymagania.wymaganiaKonieczne.uprawnienia NaN
wymagania.wymaganiaPozadane.zawody NaN
wymagania.wymaganiaPozadane.zawodySzczegoly NaN
wymagania.wymaganiaPozadane.staze NaN
wymagania.wymaganiaDodatkowe.zawody NaN
wymagania.wymaganiaDodatkowe.zawodySzczegoly NaN
wymagania.inneWymagania abc
danePracodawcy.pracodawca abc
danePracodawcy.nip abc
danePracodawcy.regon abc
danePracodawcy.branza None
pozostaleDane.identyfikatorOferty abc
pozostaleDane.ofertaZgloszonaPrzez abc
pozostaleDane.ofertaZgloszonaPrzezKodJednostki abc
kodPoziomuWyksztalcenia RPs002|WY
kodTypuWyksztalcenia abc

Add a new column to a dataframe based on first value in row

I have a dataframe like such:
>>> import pandas as pd
>>> pd.read_csv('csv/10_no_headers_with_com.csv')
//field field2
0 //first field is time NaN
1 132605 1.0
2 132750 2.0
3 132772 3.0
4 132773 4.0
5 133065 5.0
6 133150 6.0
I would like to add another field that says whether the first value of the first field is a comment character, //. So far I have something like this:
# may not have a heading value, so use the index not the key
df[0].str.startswith('//')
What would be the correct way to add on a new column with this value, so that the result is something like:
pd>>> pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
0 1 _starts_with_comment
0 //field field2 True
1 //first field is time NaN True
2 132605 1 False
3 132750 2 False
4 132772 3 False
What is the issue with your command, simply assigned to a new column?:
df['comment_flag'] = df[0].str.startswith('//')
Or do you indeed have mixed type columns as mentioned by jpp?
EDIT:
I'm not quite sure, but from your comments I get the impression you don't really need an additional column of comment flags. Just in case you want to load the data without comments into a dataframe but still use field names somewhat hidden in the commented header as column names, you might want to check this out:
So based on this textfile:
//field field2
//first field is time NaN
132605 1.0
132750 2.0
132772 3.0
132773 4.0
133065 5.0
133150 6.0
You could do:
cmt = '//'
header = []
with open(textfilename, 'r') as f:
for line in f:
if line.startswith(cmt):
header.append(line)
else: # leave that out if collecting all comments of entire file is ok/wanted
break
print(header)
# ['//field field2\n', '//first field is time NaN\n']
This way you have the header information prepared for being used for e.g. column names.
Getting the names from the first header line and using it for pandas import would be like
nms = header[0][2:].split()
df = pd.read_csv(textfilename, comment=cmt, names=nms, sep='\s+ ', engine='python')
field field2
0 132605 1.0
1 132750 2.0
2 132772 3.0
3 132773 4.0
4 133065 5.0
5 133150 6.0
One way is to utilise pd.to_numeric, assuming non-numeric data in the first column must indicate a comment:
df = pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
df['_starts_with_comment'] = pd.to_numeric(df[0], errors='coerce').isnull()
Just note this kind of mixing types within series is strongly discouraged. Your first two series will no longer support vectorised operations as they will be stored in object dtype series. You lose some of the main benefits of Pandas.
A much better idea is to use the csv module to extract those attributes at the top of your file and store them as separate variables. Here's an example of how you can achieve this.
Try this:
import pandas as pd
import numpy as np
df.loc[:,'_starts_with_comment'] = np.where(df[0].str.startswith(r'//'), True, False)

pd.to_csv saves, but apparently the wrong data (according to print function)

this is my first post on stackoverflow. Please be easy on me, if I don't follow the common styleguide correctly.
I am doing the kaggle challenge "predict house_prices". My first step is to preprocess the dataset. There are empty cells in the code "NaN". With df["Headline"].fillNA("NA") I change it to "NA" which, in this challenge, is defined as not further described.
The print function shows, that the approach works. At the end of it, I want to save my modified DataFrame into a .csv file (you can see path and filename in the code).
However, while the .csv does indeed save, the data apparently is wrong. So, I guess I must've done a mistake with the syntax of pd.to_csv.
First, here's my code. Afterwards, you find what the console says about the modified dataframe "maindf" and the dataframe of my .csv file "csvdf". Sorry for the poor formatting with the console by the way.
import os
import pandas as pd
import numpy as np
#Variables
PRICE = []
CRIT = []
#Directories
DATADIR = r"C:\Users\Hp\Desktop\Project_Arcus\house_price\data"
DATA = "train.csv"
path = os.path.join(DATADIR, DATA)
MODFILE = "train_modified.csv"
mod_path = os.path.join(DATADIR, MODFILE)
print(f"Training Data is {path}")
print(f"Modified Training Data is{mod_path}")
# Goal: Open the document of the chosen path. Extract data (f. e. the headline)
df = pd.read_csv(path)
maindf = df # this step is unnecessary, but it helped me to better understand.
# Goal: Check for empty cells. Replace them with a fitting value, so the neural network can
# threat them accordingly. Save the .csv under a new name.
maindf["PoolQC"] = df["PoolQC"].fillna("NA")
maindf["MiscFeature"] = df["MiscFeature"].fillna("NA")
maindf["Alley"] = df["Alley"].fillna("NA")
maindf["Fence"] = df["Fence"].fillna("NA")
maindf["FireplaceQu"] = df["FireplaceQu"].fillna("NA")
maindf.to_csv(mod_path,index=True) # index=False means there will be no row names (index).
# Next Goal: Save the dataframe df into a csv document "train_modified.csv" WORKS
# Check if the new file is correct. Not correct! NaN included...!
#print(df.isnull().sum())
csvdf = pd.read_csv(mod_path)
#print(csvdf.isnull().sum())
print(maindf["PoolQC"].head(10))
print(csvdf["PoolQC"].head(10))
Training Data is
C:\Users\Hp\Desktop\Project_Arcus\house_price\data\train.csv
Modified Training Data is
C:\Users\Hp\Desktop\Project_Arcus\house_price\data\train_modified.csv
0 NA 1 NA 2 NA 3 NA 4 NA 5 NA 6 NA 7 NA 8
NA 9 NA Name: PoolQC, dtype: object 0
NaN 1 NaN 2 NaN 3
NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN Name:
PoolQC, dtype: object
The issue isn't with to_csv, it's with read_csv, the documentation for which states:
na_values : scalar, str, list-like, or dict, default None
By default the following values are interpreted as NaN: ‘’, ‘#N/A’,
‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’,
‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
Instead, define keep_default_na and na_values arguments when you use read_csv:
csvdf = pd.read_csv(mod_path, keep_default_na=False, na_values='')
You may wish to supply a list of values for na_values: if used with keep_default_na=False, Pandas will consider only those values as NaN.
A better idea is to use a less ambiguous string than 'NA' to represent data you don't want to be read as NaN.

First time organizing columns in text file

I am a first time python user. I have a text file of star data that I need to sort the columns and then just take the data from the V band. I have no idea how to start. Can someone please help even if it's to just get me started?
If you can install Pandas from here then sorting on any column can be done like this:
#!/usr/bin/python
# read_stars.py
import sys
import pandas as pd
filename = sys.argv[1] # or 'star_data.txt'
sep = '\t' # or ',' or ' ', etc.
df = pd.read_csv(filename, sep)
print df.sort(['Band'])
Change the commented lines to better suit your needs. sep from your comment seems the separator may be tabs (so first try '\t' and change until parsing is successful).sys.argv[1] uses the file passed as a command line argument as such:
$ python read_stars.py star_data.txt
JD Magnitude Uncertainty HQuncertainty Band Observer Code \
28 2.456420e+06 16.400 0.073 NaN V PSD
29 2.456421e+06 16.09 0.090 NaN V DKS
... (etc) ...
42 STD NaN NaN NaN
0 STD NaN NaN NaN
[58 rows x 23 columns]
Hope this helps!

Categories