How can I have Pandas recognize the structure of my data properly? - python

I have some data saved in ".txt" files. this is how they are stored:
I used the code below to read the data and save it in a data frame object: (no need to mention that I'm using pandas library of python):
new_df = pd.read_csv(location, sep='\t', lineterminator='\n', names=None)
the problem is that when I get the shape of my data frame with new_df.shape I end up with: (123,1). It does not recognize that the data have 4 columns. How can I fix this?

It seems you don't have tab but spaces - use sep="\s+"

From your screenshot, your data appear to be in fixed width format.
Try to use pandas.read_fwf to read your data file:
pd.read_fwf(location)
You may pass the colspecs=... argument to tell it in which column each of the data are, but the routine is smart enough to figure this out automagically.

Related

How to separately add a header row while loading a parquet file?

While handling csv files we can say:
df = pd.read_csv("test.csv", names=header_list, dtype=dtype_dict)
Above would create a dataframe with headers as header_list and dtypes as of the dtype_dict
Can we do something similar with pd.read_parquet() ?
My issue involves passing in headers separately and would thus not be available in the "test.csv"
Another way to bypass it could be to move the entire data in df downwards by 1 (including shifting headers into rows) and then replacing the header with header_list (if it's even possible?)
Is there an optimal solution to my issue?
I'm not too familiar with parquet so any guidance would be appreciated, thanks.
Can we do something similar with pd.read_parquet() ?
parquet files contain some metadata, including the name of the columns and their types. So there is no need to pass this information when loading the data.

Saving columns from csv

I am trying to write a code that reads a csv file and can save each columns as a specific variable. I am having difficulty because the header is 7 lines long (something I can control but would like to just ignore if I can manipulate it in code), and then my data is full of important decimal places so it can not change to int( or maybe string?) I've also tried just saving each column by it's placement in the file but am struggling to run it. Any ideas?
Image shows my current code that I have slimmed to show important parts and circles data that prints in my console.
save each columns as a specific variable
import pandas as pd
pd.read_csv('file.csv')
x_col = df['X']
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
If what you are looking for is how to iterate through the columns, no matter how many there are. (Which is what I think you are asking.) Then this code should do the trick:
import pandas as pd
import csv
data = pd.read_csv('optitest.csv', skiprows=6)
for column in data.columns:
# You will need to define what this save() method is.
# Just placing it here as an example.
save(data[column])
The line about formatting your data as a number or a string was a little vague. But if it's decimal data, then you need to use float. See #9637665.

Prevent New Line Formation in json_normalize Data Frames

I am working to flatten some tweets into a wide data frame. I simply use the pandas.json_normalize function on my to perform this.
I then save this data frame into a CSV file. The CSV format when uploaded produces some rows that are associated with the above, rather than holding all the data on a single row. I discovered this issue when uploading the CSV into R and into Domo.
When I run the following command in a jupyter notebook the CSV loads fine,
sb_2019 = pd.read_csv('flat_tweets.csv',lineterminator='\n',low_memory=False)
Without the lineterminator I see this error:
Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Needs:
I am looking for a post-processing step to eliminate the need for a the lineterminator. I need to open the CSV in platforms and languages that do not have this specification. How might I go about doing this?
Note:
I am working with over 700k tweets. The json_normalize function works great on small pieces of my data where issues are being found. When I run json_normalize on the whole dataset I am finding this issue.
Try using '\r\n' or '\r' as lineterminator, and not '\n'.
This solution would be helpful too, opening in universal-new-line mode:
sb_2019 = pd.read_csv(open('flat_tweets.csv','rU'), encoding='utf-8', low_memory=False)

Creating a dataframe from a csv file in pandas: column issue

I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the
delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)
Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).

Export Python dataframe into CSV

I have a dataframe x with the size of ( 2000, 3000) . I would like to export it into CSV to use in R. I tried this code:
x.to_csv("ab.csv", sep='\t')
However, when I open in R by the code:
data = read.csv(".data/ab.csv")
The size of data is (2000,1) because the CSV file can not separate into 3000 columns. Is there any solution to keep the same size after exporting ?
By using the parameter sep='\t' you have written a "CSV" which uses tab to separate fields instead of a comma. You could either remove the parameter and write a normal CSV, or use the sep="\t" argument for read.csv in R. If there's no reason to use tab then I would suggest the former option.
try reading the csv file like this
data = read.csv(".data/ab.csv",sep="\t")
Your csv uses \t to separate each value, using the sep pararmeter you have to specify the separator when opening it

Categories