Here we go again.
Hi, I'm trying to detect an error in a CSV file.
The file should look as follows
goodfile.csv
"COL_A","COL_B","COL_C","COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
But the file I have is actually
brokenfile.csv
"COL_A","COL_B",COL C,"COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
When I import the two files with pandas
data = pd.read_csv('goodfile.csv')
data = pd.read_csv('brokenfile.csv')
I get the same result
data
COL_A COL_B COL_C COL_D
0 ROW1COLA ROW1COLB ROW1COLC ROW1COLD
1 ROW2COLA ROW2COLB ROW2COLC ROW2COLD
2 ROW3COLA ROW3COLB ROW3COLC ROW3COLD
3 ROW4COLA ROW4COLB ROW4COLC ROW4COLD
4 ROW5COLA ROW5COLB ROW5COLC ROW5COLD
5 ROW6COLA ROW6COLB ROW6COLC ROW6COLD
6 ROW7COLA ROW7COLB ROW7COLC ROW7COLD
Anyway, what I want is to detect the error in the second file "brokenfile.csv" that currently lacks "" between the header COL_C
I think you can detect missing " in columns of DataFrame with str.contains and boolean indexing with inverted boolean array by ~:
import pandas as pd
import io
temp=u'''"COL_A","COL_B",COL C,"COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), quoting = 3)
print df
"COL_A" "COL_B" COL C "COL_D"
0 "ROW1COLA" "ROW1COLB" "ROW1COLC" "ROW1COLD"
1 "ROW2COLA" "ROW2COLB" "ROW2COLC" "ROW2COLD"
2 "ROW3COLA" "ROW3COLB" "ROW3COLC" "ROW3COLD"
3 "ROW4COLA" "ROW4COLB" "ROW4COLC" "ROW4COLD"
4 "ROW5COLA" "ROW5COLB" "ROW5COLC" "ROW5COLD"
5 "ROW6COLA" "ROW6COLB" "ROW6COLC" "ROW6COLD"
6 "ROW7COLA" "ROW7COLB" "ROW7COLC" "ROW7COLD"
print df.columns
Index([u'"COL_A"', u'"COL_B"', u'COL C', u'"COL_D"'], dtype='object')
print df.columns.str.contains('"')
[ True True False True]
print ~df.columns.str.contains('"')
[False False True False]
print df.columns[~df.columns.str.contains('"')]
Index([u'COL C'], dtype='object')
Pandas is trying to be smart about recognising data types when reading in data. This is exactly what's happening in the case you're describing, COL_C and "COL_C" are both parsed as strings really.
In short, there is no error to detect! At least pandas will not produce an error in cases like this.
What you could do, if you wanted to detect missing quotes in header, you could try to read in the first line in a more "traditional" pythonic way and make your own conclusions from there:
>>> with open('filename') as f:
lines = f.readlines()
....
Related
I have found a numerous number of similar questions on stackoverflow, however, one issue remains unsolved to me. I have a heavily nested “.json” file I need to import and convert into R or Python data.frame to work with. Json file contains lists inside (usually empty but sometime contains data). Example of json's structure:
I use R's library jsonlite and Python's pandas.
# R
jsonlite::fromJSON(json_file, flatten = TRUE)
# or
jsonlite::read_json(json_file, simplifyVector = T)
# Python
with open(json_file.json, encoding = "utf-8") as f:
data = json.load(f)
pd.json_normalize(data)
Generally, in both cases it work. The output looks like a normal data.frame, however, the problem is that some columns of a new data.frame contain an embedded lists (I am not sure about "embedded lists" whether it's correct and clear). Seems that both Pandas and jsonlite combined each list into single column, which is clearly seen in the screens below.
R
Python
As you might see some columns, such as wymagania.wymaganiaKonieczne.wyksztalcenia is nothing but a vector contains a combined/embedded list, i.e. content of a list has been combined into single column.
As a desired output I want to split each element of such lists as a single column of a data.frame. In other words, I want to obtain normal “in tidy sense” data.frame without any nested either data.frames and lists. Both R and Python codes are appreciated.
Minimum reproducible example:
[
{
"warunkiPracyIPlacy":{"miejscePracy":"abc","rodzajObowiazkow":"abc","zakresObowiazkow":"abc","rodzajZatrudnienia":"abc","kodRodzajuZatrudnienia":"abc","zmianowosc":"abc"},
"wymagania":{
"wymaganiaKonieczne":{
"zawody":[],
"wyksztalcenia":["abc"],
"wyksztalceniaSzczegoly":[{"kodPoziomuWyksztalcenia":"RPs002|WY","kodTypuWyksztalcenia":"abc"}],
"jezyki":[],
"jezykiSzczegoly":[],
"uprawnienia":[]},
"wymaganiaPozadane":{
"zawody":[],
"zawodySzczegoly":[],
"staze":[]},
"wymaganiaDodatkowe":{"zawody":[],"zawodySzczegoly":[]},
"inneWymagania":"abc"
},
"danePracodawcy":{"pracodawca":"abc","nip":"abc","regon":"abc","branza":null},
"pozostaleDane":{"identyfikatorOferty":"abc","ofertaZgloszonaPrzez":"abc","ofertaZgloszonaPrzezKodJednostki":"abc"},
"typOferty":"abc",
"typOfertyNaglowek":"abc",
"rodzajOferty":["DLA_ZAREJESTROWANYCH"],"staz":false,"link":false}
]
This is an answer for Python. It is not very elegant, but I think it will do for your purpose.
I have called your example file nested_json.json
import json
import pandas as pd
json_file = "nested_json.json"
with open(json_file, encoding="utf-8") as f:
data = json.load(f)
df = pd.json_normalize(data)
df_exploded = df.apply(lambda x: x.explode()).reset_index(drop=True)
# check based on first row whether its of type dict
columns_dict = df_exploded.columns[df_exploded.apply(lambda x: isinstance(x[0], dict))]
# append the splitted dict to the dataframe
for col in columns_dict:
df_splitted_dict = df_exploded[col].apply(pd.Series)
df_exploded = pd.concat([df_exploded, df_splitted_dict], axis=1)
This leads to a rectangular dataframe
>>> df_exploded.T
0
typOferty abc
typOfertyNaglowek abc
rodzajOferty DLA_ZAREJESTROWANYCH
staz False
link False
warunkiPracyIPlacy.miejscePracy abc
warunkiPracyIPlacy.rodzajObowiazkow abc
warunkiPracyIPlacy.zakresObowiazkow abc
warunkiPracyIPlacy.rodzajZatrudnienia abc
warunkiPracyIPlacy.kodRodzajuZatrudnienia abc
warunkiPracyIPlacy.zmianowosc abc
wymagania.wymaganiaKonieczne.zawody NaN
wymagania.wymaganiaKonieczne.wyksztalcenia abc
wymagania.wymaganiaKonieczne.wyksztalceniaSzcze... {'kodPoziomuWyksztalcenia': 'RPs002|WY', 'kodT...
wymagania.wymaganiaKonieczne.jezyki NaN
wymagania.wymaganiaKonieczne.jezykiSzczegoly NaN
wymagania.wymaganiaKonieczne.uprawnienia NaN
wymagania.wymaganiaPozadane.zawody NaN
wymagania.wymaganiaPozadane.zawodySzczegoly NaN
wymagania.wymaganiaPozadane.staze NaN
wymagania.wymaganiaDodatkowe.zawody NaN
wymagania.wymaganiaDodatkowe.zawodySzczegoly NaN
wymagania.inneWymagania abc
danePracodawcy.pracodawca abc
danePracodawcy.nip abc
danePracodawcy.regon abc
danePracodawcy.branza None
pozostaleDane.identyfikatorOferty abc
pozostaleDane.ofertaZgloszonaPrzez abc
pozostaleDane.ofertaZgloszonaPrzezKodJednostki abc
kodPoziomuWyksztalcenia RPs002|WY
kodTypuWyksztalcenia abc
EDIT: I have stripped down the file to the bits that are problematic
raw_data = {"link":
['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}
df = pd.DataFrame(raw_data, columns = ["link"])
#duplicate check #1
a = print(df.iloc[12][0])
b = print(df.iloc[13][0])
if a == b:
print("equal")
#duplicate check #2
df.duplicated()
For the first test I get the following output implying there is a duplicate
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal
For the second test it seems there are no duplicates
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
dtype: bool
Original post:
Trying to identify duplicate values from the "Link" column of attached file:
data file
import pandas as pd
data = pd.read_csv(r"...\consolidated.csv", sep=",")
df = pd.DataFrame(data)
del df['Unnamed: 0']
duplicate_rows = df[df.duplicated(["Link"], keep="first")]
pd.DataFrame(duplicate_rows)
#a = print(df.iloc[42657][15])
#b = print(df.iloc[42676][15])
#if a == b:
# print("equal")
Used the code above, but the answer I keep getting is that there are no duplicates. Checked it through Excel and there should be seven duplicate instances. Even selected specific cells to do a quick check (the part marked with #s), and the values have been identified as equal. Yet duplicated does not capture them
I have been scratching my head for a good hour, and still no idea what I'm missing - help appreciated!
I had the same problem and converting the columns of the dataframe to "str" helped.
eg.
df['link'] = df['link'].astype(str)
duplicate_rows = df[df.duplicated(["link"], keep="first")]
First, you don't need df = pd.DataFrame(data), as data = pd.read_csv(r"...\consolidated.csv", sep=",") already returns a Dataframe.
As for the deletion of duplicates, check the drop_duplicates method in the Documentation
Hope this helps.
I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line
I want to read in a file that looks like this:
1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09
So, each column is denoted by a 15 character sequence, but there's no official separator. Does pandas have a way of doing this?
Yes! its called pd.read_fwf
from io import StringIO
import pandas as pd
txt = """ 1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09"""
pd.read_fwf(StringIO(txt), widths=[15] * 5, header=None)
0 1 2 3 4
0 1.499981e-01 2.499968e-01 3.999948e-01 5.999922e-01 9.999871e-01
1 1.499981e+00 2.499968e+00 5.999922e+00 9.999871e+00 1.999974e+01
2 4.999935e+01 9.999871e+01 0.000000e+00 -2.706363e+03 -6.370275e+03
3 -1.975213e+04 -4.649283e+04 -1.094354e+05 -3.393231e+05 -7.987023e+05
4 -1.879993e+06 -5.829214e+06 -1.372079e+07 -2.263858e+07 -4.254295e+07
5 -7.601675e+07 -1.254220e+08 -2.356903e+08 -3.888620e+08 -7.307020e+08
6 -1.305466e+09 -2.153480e+09 -4.044550e+09 -4.548962e+09 -5.325339e+09
Let's look at using pd.read_fwf:
df = pd.read_fwf(csv_file,widths=[15]*5,header=None)
Let's do like that:
for example: housing.data
dataset = pd.read_csv('c:/1/housing.data', engine = 'python', sep='\s+', header=None)
This question already has answers here:
Fast punctuation removal with pandas
(4 answers)
Closed 4 years ago.
Using Canopy and Pandas, I have data frame a which is defined by:
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"]
test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.
Assuming df looks like:
test
%hgh&12
abc123!!!
porkyfries
I want my results to be:
test
hgh12
abc123
porkyfries
Effort so far:
from string import punctuation /-- import punctuation list from python itself
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"] /-- define the dataframe
for p in list(punctuation):
...: df2=df.med.str.replace(p,'')
...: df2=pd.DataFrame(df2);
...: df2
The command above basically just returns me with the same data set.
Appreciate any leads.
Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows.
Long story short, I need to clean data in a very efficient manner for big data sets.
Use replace with correct regex would be easier:
In [41]:
import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
text
0 test
1 %hgh&12
2 abc123!!!
3 porkyfries
[4 rows x 1 columns]
use regex with the pattern which means not alphanumeric/whitespace
In [49]:
df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
text
0 test
1 hgh12
2 abc123
3 porkyfries
[4 rows x 1 columns]
For removing punctuation from a text column in your dataframme:
In:
import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)
pattern
Out:
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]'
In:
df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df
Out:
text
0 book...regh
1 book...
2 boo,
3 book.
4 ball,
5 ballnroll"
6 "rope"
7 rick %
In:
df['text'] = df['text'].str.replace(pattern, '')
df
You can replace the pattern with your desired character. Ex - replace(pattern, '$')
Out:
text
0 bookregh
1 book
2 boo
3 book
4 ball
5 ballnroll
6 rope
7 rick
Translate is often considered the cleanest and fastest way to remove punctuation (source)
import string
text = text.translate(None, string.punctuation.translate(None, '"'))
You may find that it works better to remove punctuation in 'a' before loading it into pandas.