Python data science cheatsheet
Pandas
import pandas as pd
Load CSV with ‘;’ separator
df = pd.read_csv(“data/entryRequests.csv”, sep = “;”, parse_dates=[[“Date”, “Time”]])
Print column names
print(df.columns)
Importing
pd.read_csv(filename) # From a CSV file
pd.read_table(filename) # From a delimited text file (like TSV)
pd.read_excel(filename) # From an Excel file
pd.read_sql(query, connection_object) # Read from a SQL table/database
pd.read_json(json_string) # Read from a JSON formatted string, URL or file.
pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes
pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists
From CSV
Parsing date and time columns. A new column Date_Time is created.
df = pd.read_csv("data/entryRequests.csv", sep = ";", parse_dates=[["Date", "Time"]])
Inspecting data
df.head(n) # First n rows of the DataFrame
df.tail(n) # Last n rows of the DataFrame
df.shape # Number of rows and columns
df.info() # Index, Datatype and Memory information
df.describe() # Summary statistics for numerical columns
s.value_counts(dropna=False) # View unique values and counts
df.apply(pd.Series.value_counts) # Unique values and counts for all columns
Other operations
Keep a subset of the columns
df_short = df[['color', 'food', 'score']]
Keep a subset of rows from i-th to n-th
df.iloc[ i:n, : ]
get the i-th row
df.iloc[ i, : ]
Delete i-th row
modDfObj = dfObj.drop(i)
Create time bins.
m/d/aaaa format (for more freq options check here
minute_bins = pd.date_range(start='1/1/2020', end='1/08/2020', freq='min')
Group by
If as_index=True (by default) then the column classroom would be converted into an index by the groupby, and wouldn’t be a column anymore (can’t imagine why this bahaviour, but…).
df.groupby("classroom", as_index=False).count()
Export as CSV
Withoud a column for indexes
df.to_csv('file_name.csv', index=False)
Apply an operation to all rows and save to a new column
df["new_column"] = df.apply(lambda row: some_function(row["existing_column"]), axis=1)
Note: axis=0 would apply it to all columns