I'm trying to write the keys and values of a dictionary in a .csv file for the first time. I'd like two have two columns: A) keys and B) values.
I used NLTK to get the bigrams from a large text file:
frequency=nltk.FreqDist(bigrams)
The dictionary looks like this:
('word1', 'word2'),1
I'd like to write the most common, say 100 bigrams, into the .csv file. My code looks like this:
import pandas as pd
import csv
common=frequency.most_common(100)
pd.DataFrame(common, columns=['word', 'count']).to_csv("output.csv", index=False)
However, the results are not being written in two different columns (A and B), they are all written in the same column (column A)
How can I fix this? Thanks!
I tried this out with random data, maybe this will help -
common = {}
common[('panda', 'white')] = 2
x = pd.DataFrame(common.items(), columns=['word', 'count'])
x.head()
x.to_csv("output.csv", index=False)
Related
I try to create function in Python Pandas where:
I read 5 csv
make some aggregations on each readed csv (just to make it easier, we can delete one column)
save each modified csv as DataFrames
Currently I have something like below, nevertheless it return only one DataFrame as output not 5, how can I change below code ?
def xx():
#1. read 5 csv
for el in [col for col in os.listdir("mypath") if col.endswith(".csv")]:
df = pd.read_csv("path/f"{el}"")
#2. making aggregations
df = df.drop("COL1", axis=1)
#3. saving each modified csv to separated DataFrames
?????
FInally I need to have 5 separated DataFrames after modifications, how can I modify my function to achieve taht in Phython Pandas ?
You can create an empty dictionnary and feed it gradually with the five processed dataframes.
Try this:
def xx():
dico_dfs={}
for el in [file for file in os.listdir("mypath") if file.endswith(".csv")]:
#1. read 5 csv
df = pd.read_csv(f"path/{el}")
#2. making aggregations
df = df.drop("COL1", axis=1)
#3. saving each modified csv to separated DataFrames
dico_dfs[el]= df
You can access to each dataframe by using the filename as a key, e.g dico_dfs["file1.csv"].
If needed, you can make a single dataframe by using pandas.concat : pd.concat(dico_dfs).
I have one excel with large data. I want to split this excel into multiple excel with equal distribution of rows.
My current code is working partially as it is distributing required number of rows and creating multiple excel. but at the same time it is keep creating more excel by considering the rows number.
In n_partitions if I put number 5 then it is creating excel with 5 rows in two excel and after that it keeps creating three more blank excel.
I want my code to stop creating more excel after all the rows gets distributed.
Below is my sample excel with expected result and sample code.
Code I am currently using is.
import pandas as pd
df = pd.read_excel("C:/Zen/TestZenAmp.xlsx")
n_partitions = 5
for i in range(n_partitions):
sub_df = df.iloc[(i*n_partitions):((i+1)*n_partitions)]
sub_df.to_excel(f"C:/Zen/-{i}.xlsx", sheet_name="a")
Another possible solution:
import pandas as pd
df = pd.read_excel("x.xlsx")
k = 5
g = df.groupby([df.index // k])
df['id'] = g.ngroup()
(g.apply(lambda x: x.drop('id', 1)
.to_excel(f"/tmp/x-{pd.unique(x.id)[0]}.xlsx", sheet_name="a")))
You can use the code below to split your DataFrame into 5-size chunks :
n = 5
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
You can access to every chunk like this :
>>> list_df[0]
>>> list_df[2]
Then you can loop through the list of chunks/sub-dataframes and create separate Excel files :
i=1
for sub_df in list_df:
sub_df.to_excel(f"C:/Zen/-{i}.xlsx", sheet_name="a", index=False)
i+=1
I have sentences in csv that I'd like to split it by the delimiter of spaces.
I've tried using this :-
df2 = df["Text"].str.split()
but it doesnt gave out the ideal result. It shows up like this instead.
I am aware how to do it via power query in excel but I would like to learn how to do similar move using Python.
Here's the ideal result that I'd like to achieve
try this:
df2 = df["Text"].str.split(',', expand=True)
The problem in doing this is that the max length of your sentences is going to be fixed. Having said that, you could try the following code:
import pandas as pd
final_df = original_df['Sentence'].str.split(',', expand=True)
final_df = final_df.add_prefix('Text.')
Note that empty columns will be filled with None. If you want these columns to look like empty entries, you could add the following code, which will replace all None's by an empty string:
final_df = final_df.replace([None], [''])
Hope this will be useful.
My CSV file looks something like this:
Version,1,2,3,4,5
Letter,A,B,C,D,E
Version and letter are both the keys I want in the dictionary which I read in using indexcol = 0. However, I want to save these values into a dictionary like this:
{Version: ["1,2,3,4,5"], Letter: ["A,B,C,D,E"]}
How would I do this using pandas?
Thanks so much for your help.
Since you have the keys in rows, one simple way is to transpose the csv file and then converting columns to list followed by zip and to dictionary
df = pandas.read_csv("filename.csv").transpose()
dictionary = dict(zip(list(df.column_name1), list(df.column_name2)))
Other efficient way and if you have too many values in rows to enter each column name then you can try:
df = pandas.read_csv("filename.csv").transpose().to_dict()
I have a dataset that I want to move to spark sql. This dataset has about 200 columns. The best way I have found to doing this is mapping the data to a dictionary and then moving that dictionary to a spark sql table.
The problem is that if I move it to a dictionary, the code will be super hacky and not robust. I will probably have to write something like this:
lines = sc.textFile(file_loc)
#parse commas
parts = lines.map(lambda l: l.split(","))
#split data into columns
columns = parts.map(lambda p:{'col1':p[0], 'col2':p[1], 'col3':p[2], 'col4': p[3], ;'col5': p[4], 'col6':p[5], 'col7':p[6], 'col8':p[7], col9':p[8], 'col10':p[9], 'col11':p[10], 'col12':p[11], 'col13':p[12]})
I only did 13 columns since I didn't feel like typing more than this, but you get the idea.
I would like to do something similar to how you read a csv into a data frame in R where you specify the column names into a variable and then use that variable to name all columns.
example:
col_names <- ('col0','col1','col2','col3','col4','col5','col6','col7','col8','col9','col10','col11','col12','col3')
df <- read.csv(file_loc, header=FALSE, col.names=col_names)
I cannot use a pandas data frame since the data structure is not available for use in spark at the moment.
Is there a way to create a dictionary in python similar to the way you create a data frame in R?
zip might help.
dict(zip(col_names, p))
You can use izip if you're concerned about the extra memory for the intermediate list.