I've got a .txt file that looks like this:
id nm lat lon countryCode
5555555 London 55.876456 99.546231 UK
I need to parse each field and add them to a SQLite database. So far I've managed to transfer into my db the id, name and countryCode columns, but I'm struggling to find a solution to parse the lat and lon of each record individually.
I tried with regex, but no luck. I also thought about making a parser to check if the last non-whitespace char is a letter, to determine that the string is lat and not lon, but have no idea how to implement it correctly. Can I solve it using regex or should I use a custom parser? if so, how?
You can do that with pandas like this:
import pandas as pd
import sqlite3
con = sqlite3.connect('path/new.db')
con.text_factory = str
df = pd.read_csv('file_path', sep='\t')
df.to_sql('table_01', con)
If there are bad lines and you can afford to skip them then use this:
df = pd.read_csv('file_path', sep='\t', error_bad_lines=False)
Read more.
Looking at the text file, it looks like it's always the same format for each line. As such, why not just split like this:
for line in lines:
id, nm, lat, lon, code = line.split()
# Insert into SQLite db
With split() you don't have to worry about how much whitespace there is between each token of the string.
using str.split
txt = '5555555 London 55.876456 99.546231 UK'
(id, nm, lat, lon, countryCode) = txt.split()
Related
So I'm working on taking a txt file and converting it into a csv data table.
I have managed to convert the data into a csv file and put it into a table, but I have a problem with extracting the numbers. In the data table that I made, it's giving me text as well as the value (intensity = 12345).
How do I only put the numerical values into the table?
I tried using regular expressions, but I couldn't get it to work. I would also like to delete all the lines that contain saturated, fragmented and merged. I initially created a code that would delete every uneven line, but this is a code that will be used for several files, so the odd lines in other files might have different data in them. How would I go about doing that?
This is the code that I currently have, plus a picture of what the output looks like.
import pandas as pd
parameters = pd.read_csv("ScanHeader1.txt", header=None)
parameters.columns = ['Packet Number', 'Intensity','Mass/Position']
parameters.to_csv('ScanHeader1.csv', index=None)
df = pd.read_csv('ScanHeader1.csv')
print(df)
I would really appreciate some tips or pointers on how I can do this. Thanks :)
you can try this
def fun_eq(x):
x = x.split(' = ')
return x[1]
def fun_hash(x):
x = x.split(' # ')
return x[1]
df = df.iloc[::2]
df['Intensity'] = df['Intensity'].apply(fun_eq)
df['Mass/Position'] = df['Mass/Position'].apply(fun_eq)
df['Packet Number'] = df['Packet Number'].apply(fun_hash)
I was provided with a JSON file which looks something like below when opened with Atom:
["[{\"column1\":value1,\"column2\":value2,\"column3\":value3 ...
I tried loading it in Jupyter with pandas read_json as such:
data = pd.read_json('filename.json', orient = 'records')
And when I print data.head(), it shows the result below:
screenshot of results
I have also tried the following:
import json
with open('filename.json', 'r') as file:
data = json.load(file)
When I check with type(data) I see that it is a list. When I check with data[0][1], it returns me { i.e. it seems that the characters in the file has been loaded as a single element in the list?
Just wondering if I am missing anything? I am expecting the JSON file to be loaded as a dataframe so that I can analyze the data inside. Appreciate any guidance and advice. Thanks in advance!
Ok so I think as head() only shows one entry that the outer brackets are not needed. I would try to read your file as a string and change the string to something that pd.read_json() can parse. I assume that your file contains data in a form like this:
["[{\"column1\":2,\"column2\":\"value2\",\"column3\":4}, {\"column1\":4,\"column2\":\"value2\",\"column3\":8}]"]
Now, I would read it and remove trailing \n if they exist and correct the automatic escaping of the read() method. Then I remove [" and "] from the string with this code:
with open('input.json', 'r') as file:
data = file.read().rstrip()
cleaned_string = data.replace('\\', '')[2:-2]
The result is now a valid json string that looks like this:
'[{"column1":2,"column2":"value2","column3":4}, {"column1":4,"column2":"value2","column3":8}]'
This string can now be easily read by pandas with this line:
pd.read_json(cleaned_string, orient = 'records')
Output:
column1 column2 column3
0 2 value2 4
1 4 value2 8
The specifics (e.g. the indices to remove unused characters) could be different for your string as I do not know your input. However, I think this approach allows you to read your data.
Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])
I'm having issues splitting a CSV file through PySpark. I'm trying to output the country and name of the wine (this is just to prove the parsing is working), but I get an error.
This is how the CSV file looks:
,country,description,designation,points,price,province,region_1,region_2,variety,winery
20,US,"Heitz has made this stellar rosé from the rare Grignolino grape since 1961. Ruby grapefruit-red, it's sultry with strawberry, watermelon, orange zest and salty spice flavor, highlighted with vibrant floral aromas.",Grignolino,95,24.0,California,Napa Valley,Napa,Rosé,Heitz
and here is my code
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("SQLProject")
sc = SparkContext(conf = conf)
def parseLine(line):
fields = line.split(',')
country = fields[1]
points = fields[4]
return country, points
lines = sc.textFile("file:///Users/luisguillermo/IE/Spark/Final Project/wine-reviews/winemag-data-130k-v2.csv")
rdd = lines.map(parseLine)
results = rdd.collect()
for result in results:
print(result)
And get this error:
File "/Users/luisguillermo/IE/Spark/Final Project/wine-reviews/country_and_points.py", line 10, in parseLine
points = fields[4]
IndexError: list index out of range
It appears that the program gets confused as there are commas in the description. Any ideas on how to fix this?
I would recommend using Spark built-in CSV data source as it provides many options including quotes which is used to read delimiter from columns, Ofcourse the column with delimiter should have quoted with some character.
quotes
When you have a column with a delimiter that used to split the columns, use quotes option to specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using this option you can set any character.
If you wanted to read other options Spark CSV provides along with examples, I would suggest reading the following articles.
spark-read-csv-file-into-dataframe
read-csv
Happy Learning !!
See this code:
df = spark.read\
.csv('data.csv')
df.printSchema()
df.show()
The resulting df is a DataFrame with the columns just like the CSV.
See more advance features here
I want to delimit some data from a txt file into a dataframe, but when I open this file via pandas module, the data just has 1 column. I want to delimit this data into 17 columns. The data from txt file look like:
In python, I have the following code using pandas:
import pandas as pd
count=1
nama = 'Data/'+'%d.txt'%(count)
df = pd.read_table(nama,sep = '\t',header=None)
df_head1=df
df_sta=df
data_sta=df_sta.drop([0,1,2,3,4,5])
print(data_sta)
I need to split into columns like sta, date, time, Latitude, Longitude, and sta time. If i delimit in excel,the data i want look like :
The data i want
ps: i have used delim_whitespace=True, but that's not running and the message is :
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 6, saw 4
if delimit by tab '\t' doesn't work use :
df = pd.read_table(filename,delim_whitespace=True,skiprows = 'number_of_rows_before_the_headers_start')
this is similar to using delimiter=r"\s+" but i feel the above method is faster than the regex.