How to split column without delimiter pandas - python

I have this csv-file, where the first column is the date in this format: YYYYMMDD. I don't need the YYYY. I would like to delete the YYYY part and keep the MMDD part. But there is no delimiter between those. I've tried a couple of things, but nothing worked. Except for this method, which loops through each row and deletes the year, but this takes ages for my file with more than a million rows.
This is my loop, but I can't seem to find a way to do it for all rows in one go.
def drop_year(row):
print(row[0])
data.iloc[row[0]] = str(row[0])[4:]
[drop_year(row) for row in data.iterrows()]

I found a solution, thank you all for your help.
data["YYYYMMDD"] = (data.YYYYMMDD.astype(str).str)[4:]

Related

Create a new dataset after selection in Python

Totally newebie with Python, and I'm trying to learn "on the field".
So basically I managed to open a csv file, pick only the rows that have certain values in specific columns, and then print the rows.
What I'd love to do after this is basically get a random selection of one of the found rows.
I thought to do that by creating a new csv file first, which at this point will only contains the filtered rows, and then randomly select from it.
Any ideas on the simplest way to do that?
Here's the portion of the code so far:
import csv
with open("top2018.csv") as f:
reader = csv.reader(f)
for row in reader:
if (row[4] >= "0.8") and (row[6] <= "-4") and (row[12] >= "0.8"):
print(row[2] + " -", row[1])
It will find 2 rows (I checked).
And then, for creating a new csv file:
import pandas as pd
artist = [row[2]]
name = [row[1]]
dict = {'artist': artist, 'name': name}
df = pd.DataFrame(dict)
df.to_csv('test.csv')
But I don't know why with this method, the new csv file has only 1 entry, while I'd want to have all of the found rows in it.
Hope something I wrote make sense!
Thanks guys!
You are mixing columns and rows, maybe you should rename the variable row to record so you see better what is happening. Unfortunately, I have to guess as to how the data file could look like...
The dict variable (try not to use this name, this is actually a built-in function and you don't want to overwrite it) is creating two columns, "artist", and "name", which seem to have values like [1.2]. So, dict (try to print it) could look like {"artist":[2.0], "name":[3.1]}, which is a single row, two column entity.
artist name
2.0 3.1
Try to get into pandas, use the df = pd.read_csv() and df[df.something > 0.3] style notation to filter tables, using the csv package is better suited for truly tricky data wrangling.

How to import CSV data in a single column dataframe?

I know where this error is coming from.
I try to df = pd.read_csv("\words.csv")
In this CSV, I have a column with each row filled with text.
Sometimes in this text, I have this separator a comma ,.
But I have practically all the possible symbols so I can't give it a good separator! (I have ; too)
The only thing that I know is that I only need 1 column. Is there a way to force the number of columns and not interpret the others "separators"?
Since you are aiming to have one column, one way to achieve this goal is to use a newline \n as a separator like so:
import pandas as pd
df = pd.read_csv("\words.csv", sep="\n")
Since there will always be one line per row it is bound to always detect one column.

Ask Pandas to delete all rows beneath a certain row

I have imported an Excel file as a dataframe using pandas.
I now need to delete all rows from row 41,504 (index 41,505) and below.
I have tried df.drop(df.index[41504]), although that only catches the one row. How do I tell Pandas to delete onwards from that row?
I did not want to delete by an index range as the dataset has tens of thousands of rows, and I would prefer not to scroll through the whole thing.
Thank you for your help.
Kind regards
df.drop(df.index[41504:])
Drop the remaining range. If you don't mind creating a new df, then use a filter, keeping rows [:41594].
You can reassign the range you do want back into the variable instead of removing the range you do not want.
You can just get the first rows you that you need, ignoring all the rest:
result=df[:41504]
df = df.iloc[:41504]
just another way

Python, Pandas: How to automatically skip Excel header cells and add the rest to a dataframe

Greetings I would like to transform an excel document into a dataframe, but unfortunately the excel documents are made by someone else and they will always headers like so:
Excel example
I would like to ignore the "made by stevens" and "made 04/02/21" parts and just read the relevant information like name, age, file.
How would I skip it using pandas
Is there a way to always skip those header information, even if the relevant info (name, age, file) starts at a different line on different documents? (IE in one document age is at row 4 and in another age is at row 7)
Thanks!
The function pandas.read_excel has a parameter called skiprows, if you feed it an integer it will simply skip the n first lines at the start of the file.
In your case just use:
df = pd.read_excel(filepath, skiprows=4)
The second part of your question this is trickier. Depending on your business use cases you might have different solutions. If the columns are always the same (Name, Age, file) you could import the excel file without skipping lines but with fixing the column names, then by dropping rows with empty data and the additional header row you didn't use.
If you want to skip header which is on row = 1, then you can try this
pandas.read_excel(skiprows=1, skipfooter=0)
you can specify the value in integer to skiprows=1 to skip header and skipfooter=1 to skip footer, the number can depending upon how many rows you want to skip.

How to prevent truncation using pd.read_sas()

I am new to python and use the follwing code to read in a sas-file:
df=pd.read_sas('C:\\test\\test.sas7bdat', format = 'sas7bdat', encoding = 'latin-1')
There are columns which have either a 7-string code or just "M" for missing. Columns where the first rows just have a M in the first couple of rows and only in later rows the 7-string codes are truncated to just one character for all rows, which does not happen for rows where I have a 7-string code in the first rows.
this is how the original data looks like in sas
How can I prevent pandas to truncate the text when reading in the data?
Thank you.
Lia

Categories