I have a dataset that I want to move to spark sql. This dataset has about 200 columns. The best way I have found to doing this is mapping the data to a dictionary and then moving that dictionary to a spark sql table.
The problem is that if I move it to a dictionary, the code will be super hacky and not robust. I will probably have to write something like this:
lines = sc.textFile(file_loc)
#parse commas
parts = lines.map(lambda l: l.split(","))
#split data into columns
columns = parts.map(lambda p:{'col1':p[0], 'col2':p[1], 'col3':p[2], 'col4': p[3], ;'col5': p[4], 'col6':p[5], 'col7':p[6], 'col8':p[7], col9':p[8], 'col10':p[9], 'col11':p[10], 'col12':p[11], 'col13':p[12]})
I only did 13 columns since I didn't feel like typing more than this, but you get the idea.
I would like to do something similar to how you read a csv into a data frame in R where you specify the column names into a variable and then use that variable to name all columns.
example:
col_names <- ('col0','col1','col2','col3','col4','col5','col6','col7','col8','col9','col10','col11','col12','col3')
df <- read.csv(file_loc, header=FALSE, col.names=col_names)
I cannot use a pandas data frame since the data structure is not available for use in spark at the moment.
Is there a way to create a dictionary in python similar to the way you create a data frame in R?
zip might help.
dict(zip(col_names, p))
You can use izip if you're concerned about the extra memory for the intermediate list.
Related
I have a dataframe arguments with the columns RecordID(Int) and additional_arguments(Json formatted string object). I am trying to convert each of the jsons into a dataframe and then concatenate them all into one dataframe. Currently, I am doing this with a for loop:
arguments_output = pd.DataFrame([]);
for i in range(0, len(arguments)-1):
df = pd.DataFrame(arguments['additional_arguments'][i])
df['RecordID'] = arguments['RecordID'][i]
arguments_output = pd.concat([arguments_output, df])
This takes quite a bit of time, as there are over 55000 records total. Is there a better way to achieve this? Thank you
without seeing an example input, it's hard to say, but can un-nest json.
pandas.json_normalize()
I have been trying to look for an approach that will allow me to load only those columns from csv file which satisfy certain condition while I create a DataFrame.. something which can skip the unwanted columns because I have large number of columns and only some are actually useful for testing purposes. And also to load those columns which have mean > 0.0. The ideas is like we skip certain number of rows or read first nrows... but I am looking for condition based filtering for columns' names and values.
Is this actually possible for Pandas? To do things on-fly accumulating results first without loading everything into memory?
There's no direct/easy way of doing that (that i know of)!
The first function idea that comes to mind is: to read the first line of the csv (i.e. read the headers) then create a list using list comprehension for your desired columns :
columnsOfInterest = [ c for c in df.columns.tolist() if 'node' in c]
and get their position in the csv. With that, you'll now have the columns/position so you can only read those from your csv.
However, the second part of your condition which needs to calculate the mean, unfortunately you'll have to read all data for these column, run the mean calculations and then keep those of interest (where mean is > 0). But after all, that's to my level of knowledge, maybe someone else has away of doing this and can help you out, good luck!
I think usecols is what you are looking for.
df = pandas.read_csv('<file_path>', usecols=['col1', 'col2'])
You could preprocess the column headers using the csv library first.
import csv
f = open('data.csv', 'rb')
reader = csv.reader(f)
column_names = next(reader, None)
filtered_columns = [name for name in column_names if 'some_string' in name]
Then proceed using usecols from pandas as abhi mentioned.
df = pandas.read_csv('data.csv', usecols=filtered_columns)
I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df
I am creating test and training data for an algorithm. I have data in different csv files I want to create training and test data from that.
I have imported all the csv files to the pandas dataframe using
dfs = [pd.read_csv(file) for file in datafiles]
dfs[1] has the first dataframe dfs[2] second and so on
I would like to assign them to different data frame in the format Xtest1 is dfs[1], Xtest2 is dfs[2] and so on till the end of the files
Can anyone help do it using a loop or any other idea
You mean automatically create variables and assign something to them?
try globals(), locals()
i.e.
for a in range(10):
locals()["var1_"+str(a)+"] = 1
You need to use a dictionary to do this. Can you try the following:
dfs = {'Xtest'+ str(ind): pd.read_csv(file) for ind, file in enumerate(datafiles)}
And whenever you need to access the dataframe, you can do it the following way:
dfs['Xtest1']
If you want to iterate the dictionary you can do using the following:
for i in range(4):
print(dfs['Xtest' + str(i)])
I have a raw data table with over 500 columns that I am importing to a different database. Most of these columns are null (for example: session1, session2, session3 ~ session120). I didn't design this table, but there are 3 column types with over 100 columns each. Most would not need to be used unless it was for some very specific analysis or investigation (if ever).
Is there a nice way to combine these columns into a consolidated column which can be 'unpacked' later? I don't want to lose the information in case there is something important.
Here is my naive approach (using pandas to modify the raw data before inserting it into postgres):
column_list = []
for val in range(10, 120):
column_list.append('session' + str(val))
df['session_10_to_120'] = df[column_list ].astype(str).sum(axis=1).replace('', ',', regex = True)\n",
for col in column_list :
df.drop(col, axis=1, inplace=True)
I don't want to mess up my COPY statements to postgres (where it might think that the commas are separate columns).
Any recommendations? What is the best practice here?
I depends on what you want to do with these columns, but options include
arrays
non-relational storage: hstore, json, xml
turning the columns into rows in another table