# Basic use

## Ultra-short cheat sheet

``````from datamatrix import DataMatrix, io
# Read a DataMatrix from file
# Create a new DataMatrix
dm = DataMatrix(length=5)
# The first two rows
print(dm[:2])
# Create a new column and initialize it with the Fibonacci series
dm.fibonacci = 0, 1, 1, 2, 3
# You can also specify column names as if they are dict keys
dm['fibonacci'] = 0, 1, 1, 2, 3
# Remove 0 and 3 with a simple selection
dm = (dm.fibonacci > 0) & (dm.fibonacci < 3)
# Get a list of indices that match certain criteria
print(dm[(dm.fibonacci > 0) & (dm.fibonacci < 3)])
# Select 1, 1, and 2 by matching any of the values in a set
dm = dm.fibonacci == {1, 2}
# Select all odd numbers with a lambda expression
dm = dm.fibonacci == (lambda x: x % 2)
# Change all 1s to -1
dm.fibonacci[dm.fibonacci == 1] = -1
# The first two cells from the fibonacci column
print(dm.fibonacci[:2])
# Column mean
print(dm.fibonacci[...])
# Multiply all fibonacci cells by 2
dm.fibonacci_times_two = dm.fibonacci * 2
# Loop through all rows
for row in dm:
print(row.fibonacci) # get the fibonacci cell from the row
# Loop through all columns
for colname, col in dm.columns:
for cell in col: # Loop through all cells in the column
print(cell) # do something with the cell
# Or just see which columns exist
print(dm.column_names)
``````

Important note: Because of a limitation (or feature, if you will) of the Python language, the behavior of `and`, `or`, and chained (`x < y < z`) comparisons cannot be modified. These therefore do not work with `DataMatrix` objects as you would expect them to:

``````# INCORRECT: The following does *not* work as expected
dm = dm.fibonacci > 0 and dm.fibonacci < 3
# INCORRECT: The following does *not* work as expected
dm = 0 < dm.fibonacci < 3
# CORRECT: Use the '&' operator
dm = (dm.fibonacci > 0) & (dm.fibonacci < 3)
``````

Slightly longer cheat sheet:

## Creating a DataMatrix

Create a new `DataMatrix` object with a length (number of rows) of 2, and add a column (named `col`). By default, the column is of the `MixedColumn` type, which can store numeric, string, and `None` data.

``````import sys
from datamatrix import DataMatrix, __version__
dm = DataMatrix(length=2)
dm.col = '☺'
print('DataMatrix v{} on Python {}\n'.format(__version__, sys.version))
print(dm)
``````

Output:

``````DataMatrix v1.0.1 on Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]

+---+-----+
| # | col |
+---+-----+
| 0 |  ☺  |
| 1 |  ☺  |
+---+-----+
``````

You can change the length of the `DataMatrix` later on. If you reduce the length, data will be lost. If you increase the length, empty cells (by default containing empty strings) will be added.

``````dm.length = 3
``````

You can read and write files with functions from the `datamatrix.io` module. The main supported file types are `csv` and `xlsx`.

``````from datamatrix import io

dm = DataMatrix(length=3)
dm.col = 1, 2, 3
# Write to disk
io.writetxt(dm, 'my_datamatrix.csv')
io.writexlsx(dm, 'my_datamatrix.xlsx')
# And read it back from disk!
``````

Multidimensional columns cannot be saved to `csv` or `xlsx` format but instead need to be saved to a custom binary format.

``````from datamatrix import MultiDimensionalColumn
dm.mdim_col = MultiDimensionalColumn(shape=2)
# Write to disk
io.writebin(dm, 'my_datamatrix.dm')
# And read it back from disk!
``````

## Stacking (vertically concatenating) DataMatrix objects

You can stack two `DataMatrix` objects using the `<<` operator. Matching columns will be combined. (Note that row 2 is empty. This is because we have increased the length of `dm` in the previous step, causing an empty row to be added.)

``````dm2 = DataMatrix(length=2)
dm2.col = '☺'
dm2.col2 = 10, 20
dm3 = dm << dm2
print(dm3)
``````

Output:

``````+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 |  1  |      |
| 1 |  2  |      |
| 2 |  3  |      |
| 3 |  ☺  |  10  |
| 4 |  ☺  |  20  |
+---+-----+------+
``````

Pro-tip: To stack three or more `DataMatrix` objects, using the `stack()` function from the `operations` module is faster than iteratively using the `<<` operator.

``````from datamatrix import operations as ops
dm4 = ops.stack(dm, dm2, dm3)
``````

## Working with columns

### Referring to columns

You can refer to columns in two ways: as keys in a `dict` or as properties. The two notations are identical for most purposes. The main reason to use a `dict` style is when the name of the column is itself variable. Otherwise, the property style is recommended for clarity.

``````dm['col']  # dict style
dm.col     # property style
``````

### Creating columns

By assigning a value to a non-existing colum, a new column is created and initialized to this value.

``````dm.col = 'Another value'
print(dm)
``````

Output:

``````+---+---------------+
| # |      col      |
+---+---------------+
| 0 | Another value |
| 1 | Another value |
| 2 | Another value |
+---+---------------+
``````

### Renaming columns

``````dm.rename('col', 'col2')
print(dm)
``````

Output:

``````+---+---------------+
| # |      col2     |
+---+---------------+
| 0 | Another value |
| 1 | Another value |
| 2 | Another value |
+---+---------------+
``````

### Deleting columns

You can delete a column using the `del` keyword:

``````dm.col = 'x'
del dm.col2
print(dm)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 0 |  x  |
| 1 |  x  |
| 2 |  x  |
+---+-----+
``````

### Column types

There are five column types:

• `MixedColumn` is the default column type. This can contain numbers (`int` and `float`), strings (`str`), and `None` values. This column type is flexible but not very fast because it is (mostly) implemented in pure Python, rather than using `numpy`, which is the basis for the other columns. The default value for empty cells is an empty string.
• `FloatColumn` contains `float` numbers. The default value for empty cells is `NAN`.
• `IntColumn` contains `int` numbers. (This does not include `INF`, and `NAN`, which are of type `float` in Python.) The default value for empty cells is 0.
• `MultiDimensionalColumn` contains higher-dimensional `float` arrays. This allows you to mix higher-dimensional data, such as time series or images, with regular one-dimensional data. The default value for empty cells is `NAN`.
• `SeriesColumn` is identical to a two-dimensional `MultiDimensionalColumn`.

When you create a `DataMatrix`, you can indicate a default column type.

``````# Create IntColumns by default
dm = DataMatrix(length=2, default_col_type=int)
dm.i = 1, 2  # This is an IntColumn
``````

You can also explicitly indicate the column type when creating a new column.

``````dm.f = float  # This creates an empty (`NAN`-filled) FloatColumn
dm.i = int    # This creates an empty (0-filled) IntColumn
``````

To create a `MultiDimensionalColumn` you need to import the column type and specify a shape:

``````from datamatrix import MultiDimensionalColumn
dm.mdim_col = MultiDimensionalColumn(shape=(2, 3))
print(dm)
``````

Output:

``````+---+-----+---+-----------------+
| # |  f  | i |     mdim_col    |
+---+-----+---+-----------------+
| 0 | nan | 0 |  [[nan nan nan] |
|   |     |   |  [nan nan nan]] |
| 1 | nan | 0 |  [[nan nan nan] |
|   |     |   |  [nan nan nan]] |
+---+-----+---+-----------------+
``````

You can also specify named dimensions. For example, `('x', 'y')` creates a dimension of size 2 where index 0 can be referred to as 'x' and index 1 can be referred to as 'y':

``````dm.mdim_col = MultiDimensionalColumn(shape=(('x', 'y'), 3))
``````

### Column properties

Basic numerical properties, such as the mean, can be accessed directly. For this purpose, only numerical, non-`NAN` values are taken into account.

``````dm = DataMatrix(length=3)
dm.col = 1, 2, 'not a number'
# Numeric descriptives
print('mean: %s' % dm.col.mean)  #  or dm.col[...]
print('median: %s' % dm.col.median)
print('standard deviation: %s' % dm.col.std)
print('sum: %s' % dm.col.sum)
print('min: %s' % dm.col.min)
print('max: %s' % dm.col.max)
# Other properties
print('unique values: %s' % dm.col.unique)
print('number of unique values: %s' % dm.col.count)
print('column name: %s' % dm.col.name)
``````

Output:

``````mean: 1.5
median: 1.5
standard deviation: 0.7071067811865476
sum: 3.0
min: 1.0
max: 2.0
unique values: [1, 2, 'not a number']
number of unique values: 3
column name: col
``````

The `shape` property indicates the number and sizes of the dimensions of the column. For regular columns, the shape is a tuple containing only the length of the datamatrix (the number of rows). For multidimensional columns, the shape is a tuple containing the length of the datamatrix and the shape of cells as specified through the `shape` keyword.

``````print(dm.col.shape)
dm.mdim_col = MultiDimensionalColumn(shape=(2, 4))
print(dm.mdim_col.shape)
``````

Output:

``````(3,)
(3, 2, 4)
``````

The `loaded` property indicates whether a column is currently stored in memory, or whether it is offloaded to disk. This is mainly relevant for multidimensional columns, which are automatically offloaded to disk when memory runs low.

``````print(dm.mdim_col.loaded)
``````

Output:

``````True
``````

## Assigning

### Assigning by index, multiple indices, or slice

You can assign a single value to one or more cells in various ways.

``````dm = DataMatrix(length=4)
# Create a new columm
dm.col = ''
# By index: assign to a single cell (at row 1)
dm.col = ':-)'
# By a tuple (or other iterable) of multiple indices:
# assign to cells at rows 0 and 2
dm.col[0, 2] = ':P'
# By slice: assign from row 1 until the end
dm.col[2:] = ':D'
print(dm)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 0 |  :P |
| 1 | :-) |
| 2 |  :D |
| 3 |  :D |
+---+-----+
``````

You can also assign multiple values at once, provided that the to-be-assigned sequence is of the correct length.

``````# Assign to the full column
dm.col = 1, 2, 3, 4
# Assign to two cells
dm.col[0, 2] = 'a', 'b'
print(dm)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 0 |  a  |
| 1 |  2  |
| 2 |  b  |
| 3 |  4  |
+---+-----+
``````

### Assigning to cells that match a selection criterion

As will be described in more detail later on, comparing a column to a value gives a new `DataMatrix` that contains only the matching rows. This subsetted `DataMatrix` can in turn be used to assign to the matching rows of the original `DataMatrix`. This sounds a bit abstract but is very easy in practice:

``````dm.col[1:] = ':D'
dm.is_happy = 'no'
dm.is_happy[dm.col == ':D'] = 'yes'
print(dm)
``````

Output:

``````+---+-----+----------+
| # | col | is_happy |
+---+-----+----------+
| 0 |  a  |    no    |
| 1 |  :D |   yes    |
| 2 |  :D |   yes    |
| 3 |  :D |   yes    |
+---+-----+----------+
``````

### Assigning to multidimensional columns

Assigning to multidimensional columns works much the same as assigning to regular columns. The main differences are that there are multiple dimensions, and that dimensions can be named.

``````dm = DataMatrix(length=2)
dm.mdim_col = MultiDimensionalColumn(shape=(('x', 'y'), 3))
# Set all values to a single value
dm.mdim_col = 1
# Set all last dimensions to a single array of shape 3
dm.mdim_col = [ 1,  2,  3]
# Set all rows to a single array of shape (2, 3)
dm.mdim_col = [[ 1,  2,  3],
[ 4,  5,  6]]
# Set the column to an array of shape (2, 3, 3)
dm.mdim_col = [[[ 1,  2,  3],
[ 4,  5,  6]],
[[ 7,  8,  9],
[10, 11, 12]]]
``````

To assign to dimensions by name:

``````dm.mdim_col[:, 'x'] = 1, 2, 3  # identical to assigning to dm.mdim_col[:, 0]
dm.mdim_col[:, 'y'] = 4, 5, 6  # identical to assigning to dm.mdim_col[:, 1]
``````

Pro-tip: When assigning an array-like object to a multidimensional column, the shape of the to-be-assigned array needs to match the final part of the shape of the column. This means that you can assign a (2, 3) array to a (2, 2, 3) column in which case all rows (the first dimension) are set to the array. shape However, you cannot assign a (2, 2) array to a (2, 2, 3) column.

## Accessing

### Accessing by index, multiple indices, or slice

``````dm = DataMatrix(length=4)
# Create a new column
dm.col = 'a', 'b', 'c', 'd'
# By index: select a single cell (at row 1).
print(dm.col)
# By a tuple (or other iterable) of multiple indices:
# select cells at rows 0 and 2. This gives a new column.
print(dm.col[0, 2])
# By slice: assign from row 1 until the end. This gives a new column.
print(dm.col[2:])
``````

Output:

``````b
col['a', 'c']
col['c', 'd']
``````

### Accessing and averaging (ellipsis averaging) multidimensional columns

Accessing multidimensional columns works much the same as accessing regular columns. The main differences are that there are multiple dimensions, and that dimensions can be named.

``````dm = DataMatrix(length=2)
dm.mdim_col = MultiDimensionalColumn(shape=(('x', 'y'), 3))
dm.mdim_col = [[[ 1,  2,  3],
[ 4,  5,  6]],
[[ 7,  8,  9],
[10, 11, 12]]]
# From all rows, get index 1 (named 'y') from the second dimension and index 2 from the third dimension.
print(dm.mdim_col[:, 'y', 2])
``````

Output:

``````col[ 6. 12.]
``````

You can select the average of a column using the ellipsis (`...`) index. For regular columns, this is indentical to accessing the `mean` property:

``````dm.col = 1, 2
print(dm.col[...])  # identical to `dm.col.mean`
``````

Output:

``````1.5
``````

Ellipsis averaging (`...`) is especially useful when working with multidimensional data, in which case it allows you to average over specific dimensions. As long as you don't average over the first dimension, which corresponds to the rows of the `DataMatrix`, the result is a new column.

``````# Averaging over the third dimension gives a column of shape (2, 2)
dm.avg3 = dm.mdim_col[:, :, ...]
# Average over the second dimension gives a colum of shape (2, 3)
dm.avg2 = dm.mdim_col[:, ...]
# Averaging over the second and third dimensions gives a `FloatColumn`.
dm.avg23 = dm.mdim_col[:, ..., ...]
print(dm)
``````

Output:

``````+---+------------------+-------+-----------+-----+-----------------+
| # |       avg2       | avg23 |    avg3   | col |     mdim_col    |
+---+------------------+-------+-----------+-----+-----------------+
| 0 |  [2.5 3.5 4.5]   |  3.5  |  [2. 5.]  |  1  |   [[1. 2. 3.]   |
|   |                  |       |           |     |    [4. 5. 6.]]  |
| 1 | [ 8.5  9.5 10.5] |  9.5  | [ 8. 11.] |  2  |  [[ 7.  8.  9.] |
|   |                  |       |           |     |  [10. 11. 12.]] |
+---+------------------+-------+-----------+-----+-----------------+
``````

When averaging over the first dimension, which corresponds to the rows of the `DataMatrix`, the result is either an array or (if all dimensions are averaged) a float:

``````# Averaging over the rows gives an array of shape (2, 3)
print(dm.mdim_col[...])
# Averaging over all dimensions gives a float
print(dm.mdim_col[..., ..., ...])
``````

Output:

``````[[4. 5. 6.]
[7. 8. 9.]]
6.5
``````

## Selecting

### Selecting by column values

You can select by directly comparing columns to values. This returns a new `DataMatrix` object with only the selected rows.

``````dm = DataMatrix(length=10)
dm.col = range(10)
dm_subset = dm.col > 5
print(dm_subset)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 6 |  6  |
| 7 |  7  |
| 8 |  8  |
| 9 |  9  |
+---+-----+
``````

### Selecting by multiple criteria with `|` (or), `&` (and), and `^` (xor)

You can select by multiple criteria using the `|` (or), `&` (and), and `^` (xor) operators (but not the actual words 'and' and 'or'). Note the parentheses, which are necessary because `|`, `&`, and `^` have priority over other operators.

``````dm_subset = (dm.col < 1) | (dm.col > 8)
print(dm_subset)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 0 |  0  |
| 9 |  9  |
+---+-----+
``````
``````dm_subset = (dm.col > 1) & (dm.col < 8)
print(dm_subset)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 2 |  2  |
| 3 |  3  |
| 4 |  4  |
| 5 |  5  |
| 6 |  6  |
| 7 |  7  |
+---+-----+
``````

### Selecting by multiple criteria by comparing to a set `{}`

If you want to check whether column values are identical to, or different from, a set of test values, you can compare the column to a `set` object. (This is considerably faster than comparing the column values to each of the test values separately, and then merging the result using `&` or `|`.)

``````dm_subset = dm.col == {1, 3, 5, 7}
print(dm_subset)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 1 |  1  |
| 3 |  3  |
| 5 |  5  |
| 7 |  7  |
+---+-----+
``````

### Selecting (filtering) with a function or lambda expression

You can also use a function or `lambda` expression to select column values. The function must take a single argument and its return value determines whether the column value is selected. This is analogous to the classic `filter()` function.

``````dm_subset = dm.col == (lambda x: x % 2)
print(dm_subset)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 1 |  1  |
| 3 |  3  |
| 5 |  5  |
| 7 |  7  |
| 9 |  9  |
+---+-----+
``````

### Selecting values that match another column (or sequence)

You can also select by comparing a column to a sequence, in which case a row-by-row comparison is done. This requires that the sequence has the same length as the column, is not a `set` object (because `set` objects are treated as described above).

``````dm = DataMatrix(length=4)
dm.col = 'a', 'b', 'c', 'd'
dm_subset = dm.col == ['a', 'b', 'x', 'y']
print(dm_subset)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 0 |  a  |
| 1 |  b  |
+---+-----+
``````

### Selecting values by type

When a column contains values of different types, you can also select values by type:

``````dm = DataMatrix(length=4)
dm.col = 'a', 1, 'c', 2
dm_subset = dm.col == int
print(dm_subset)
``````

Output:

``````+---+-----+
| # | col |
+---+-----+
| 1 |  1  |
| 3 |  2  |
+---+-----+
``````

### Getting indices for rows that match selection criteria ('where')

You can get the indices for rows that match certain selection criteria by slicing a `DataMatrix` with a subset of itself. This is similar to the `numpy.where()` function.

``````dm = DataMatrix(length=4)
dm.col = 1, 2, 3, 4
indices = dm[(dm.col > 1) & (dm.col < 4)]
print(indices)
``````

Output:

``````[1, 2]
``````

### Selecting a subset of columns

You can select a subset of columns by passing the columns as an index to `dm[]`. Columns can be specified by name ('col3') or by object (`dm.col1`).

``````dm = DataMatrix(length=4)
dm.col1 = '☺'
dm.col2 = 'a'
dm.col3 = 1
dm_subset = dm[dm.col1, 'col3']
print(dm_subset)
``````

Output:

``````+---+------+------+
| # | col1 | col3 |
+---+------+------+
| 0 |  ☺   |  1   |
| 1 |  ☺   |  1   |
| 2 |  ☺   |  1   |
| 3 |  ☺   |  1   |
+---+------+------+
``````

## Element-wise column operations

You can apply basic mathematical operations on all cells in a column simultaneously. Cells with non-numeric values are ignored, except by the `+` operator, which then results in concatenation.

``````dm = DataMatrix(length=3)
dm.col = 0, 'a', 20
dm.col2 = dm.col * .5
dm.col3 = dm.col + 10
dm.col4 = dm.col - 10
dm.col5 = dm.col / 50
print(dm)
``````

Output:

``````+---+-----+------+------+------+------+
| # | col | col2 | col3 | col4 | col5 |
+---+-----+------+------+------+------+
| 0 |  0  | 0.0  |  10  | -10  | 0.0  |
| 1 |  a  |  a   | a10  |  a   |  a   |
| 2 |  20 | 10.0 |  30  |  10  | 0.4  |
+---+-----+------+------+------+------+
``````

### Applying (mapping) a function or lambda expression

You can apply a function or `lambda` expression to all cells in a column simultaneously with the `@` operator. This analogous to the classic `map()` function.

``````dm = DataMatrix(length=3)
dm.col = 0, 1, 2
dm.col2 = dm.col @ (lambda x: x*2)
print(dm)
``````

Output:

``````+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 |  0  |  0   |
| 1 |  1  |  2   |
| 2 |  2  |  4   |
+---+-----+------+
``````

## Iterating over rows, columns, and cells (for loops)

By iterating directly over a `DataMatrix` object, you get successive `Row` objects. From a `Row` object, you can directly access cells.

``````dm.col = 'a', 'b', 'c'
for row in dm:
print(row)
print(row.col)
``````

Output:

``````+------+-------+
| Name | Value |
+------+-------+
| col  |   a   |
| col2 |   0   |
+------+-------+
a
+------+-------+
| Name | Value |
+------+-------+
| col  |   b   |
| col2 |   2   |
+------+-------+
b
+------+-------+
| Name | Value |
+------+-------+
| col  |   c   |
| col2 |   4   |
+------+-------+
c
``````

By iterating over `DataMatrix.columns`, you get successive `(column_name, column)` tuples.

``````for colname, col in dm.columns:
print('%s = %s' % (colname, col))
``````

Output:

``````col = col['a', 'b', 'c']
col2 = col[0, 2, 4]
``````

By iterating over a column, you get successive cells:

``````for cell in dm.col:
print(cell)
``````

Output:

``````a
b
c
``````

By iterating over a `Row` object, you get (`column_name, cell`) tuples:

``````row = dm # Get the first row
for colname, cell in row:
print('%s = %s' % (colname, cell))
``````

Output:

``````col = a
col2 = 0
``````

The `column_names` property gives a sorted list of all column names (without the corresponding column objects):

``````print(dm.column_names)
``````

Output:

``````['col', 'col2']
``````

## Miscellanous notes

### Type conversion and character encoding

For `MixedColumn`:

• The strings 'nan', 'inf', and '-inf' are converted to the corresponding `float` values (`NAN`, `INF`, and `-INF`).
• Byte-string values (`bytes`) are automatically converted to `str` assuming `utf-8` encoding.
• Trying to assign an unsupported type results in a `TypeError`.
• The string 'None' is not converted to the type `None`.

For `FloatColumn`:

• The strings 'nan', 'inf', and '-inf' are converted to the corresponding `float` values (`NAN`, `INF`, and `-INF`).
• Unsupported types are converted to `NAN`. A warning is shown.

For `IntColumn`:

• Trying to assign non-`int` values results in a `TypeError`.

### NAN and INF values

You have to take special care when working with `nan` data. In general, `nan` is not equal to anything else, not even to itself: `nan != nan`. You can see this behavior when selecting data from a `FloatColumn` with `nan` values in it.

``````from datamatrix import DataMatrix, FloatColumn, NAN
dm = DataMatrix(length=3)
dm.f = FloatColumn
dm.f = 0, NAN, 1
dm = dm.f == [0, NAN, 1]
print(dm)
``````

Output:

``````+---+-----+
| # |  f  |
+---+-----+
| 0 | 0.0 |
| 2 | 1.0 |
+---+-----+
``````

However, for convenience, you can select all `nan` values by comparing a `FloatColumn` to a single `nan` value:

``````dm = DataMatrix(length=3)
dm.f = FloatColumn
dm.f = 0, NAN, 1
print(dm.f == NAN)
print('NaN values')
print('Non-NaN values')
print(dm.f != NAN)
``````

Output:

``````+---+-----+
| # |  f  |
+---+-----+
| 1 | nan |
+---+-----+
NaN values
Non-NaN values
+---+-----+
| # |  f  |
+---+-----+
| 0 | 0.0 |
| 2 | 1.0 |
+---+-----+
``````