Basic use
- Ultra-short cheat sheet
- Creating a DataMatrix
- Reading and writing files
- Stacking (vertically concatenating) DataMatrix objects
- Working with columns
- Assigning
- Accessing
- Selecting
- Selecting by column values
- Selecting by multiple criteria with | (or), & (and), and ^ (xor)
- Selecting by multiple criteria by comparing to a set {}
- Selecting (filtering) with a function or lambda expression
- Selecting values that match another column (or sequence)
- Selecting values by type
- Getting indices for rows that match selection criteria ('where')
- Selecting a subset of columns
- Element-wise column operations
- Iterating over rows, columns, and cells (for loops)
- Miscellanous notes
Ultra-short cheat sheet
from datamatrix import DataMatrix, io
# Read a DataMatrix from file
dm = io.readtxt('data.csv')
# Create a new DataMatrix
dm = DataMatrix(length=5)
# The first two rows
print(dm[:2])
# Create a new column and initialize it with the Fibonacci series
dm.fibonacci = 0, 1, 1, 2, 3
# You can also specify column names as if they are dict keys
dm['fibonacci'] = 0, 1, 1, 2, 3
# Remove 0 and 3 with a simple selection
dm = (dm.fibonacci > 0) & (dm.fibonacci < 3)
# Get a list of indices that match certain criteria
print(dm[(dm.fibonacci > 0) & (dm.fibonacci < 3)])
# Select 1, 1, and 2 by matching any of the values in a set
dm = dm.fibonacci == {1, 2}
# Select all odd numbers with a lambda expression
dm = dm.fibonacci == (lambda x: x % 2)
# Change all 1s to -1
dm.fibonacci[dm.fibonacci == 1] = -1
# The first two cells from the fibonacci column
print(dm.fibonacci[:2])
# Column mean
print(dm.fibonacci[...])
# Multiply all fibonacci cells by 2
dm.fibonacci_times_two = dm.fibonacci * 2
# Loop through all rows
for row in dm:
print(row.fibonacci) # get the fibonacci cell from the row
# Loop through all columns
for colname, col in dm.columns:
for cell in col: # Loop through all cells in the column
print(cell) # do something with the cell
# Or just see which columns exist
print(dm.column_names)
Important note: Because of a limitation (or feature, if you will) of the Python language, the behavior of and
, or
, and chained (x < y < z
) comparisons cannot be modified. These therefore do not work with DataMatrix
objects as you would expect them to:
# INCORRECT: The following does *not* work as expected
dm = dm.fibonacci > 0 and dm.fibonacci < 3
# INCORRECT: The following does *not* work as expected
dm = 0 < dm.fibonacci < 3
# CORRECT: Use the '&' operator
dm = (dm.fibonacci > 0) & (dm.fibonacci < 3)
Slightly longer cheat sheet:
Creating a DataMatrix
Create a new DataMatrix
object with a length (number of rows) of 2, and add a column (named col
). By default, the column is of the MixedColumn
type, which can store numeric, string, and None
data.
import sys
from datamatrix import DataMatrix, __version__
dm = DataMatrix(length=2)
dm.col = '☺'
print('DataMatrix v{} on Python {}\n'.format(__version__, sys.version))
print(dm)
Output:
DataMatrix v1.0.3 on Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]
+---+-----+
| # | col |
+---+-----+
| 0 | ☺ |
| 1 | ☺ |
+---+-----+
You can change the length of the DataMatrix
later on. If you reduce the length, data will be lost. If you increase the length, empty cells (by default containing empty strings) will be added.
dm.length = 3
Reading and writing files
You can read and write files with functions from the datamatrix.io
module. The main supported file types are csv
and xlsx
.
from datamatrix import io
dm = DataMatrix(length=3)
dm.col = 1, 2, 3
# Write to disk
io.writetxt(dm, 'my_datamatrix.csv')
io.writexlsx(dm, 'my_datamatrix.xlsx')
# And read it back from disk!
dm = io.readtxt('my_datamatrix.csv')
dm = io.readxlsx('my_datamatrix.xlsx')
Multidimensional columns cannot be saved to csv
or xlsx
format but instead need to be saved to a custom binary format.
from datamatrix import MultiDimensionalColumn
dm.mdim_col = MultiDimensionalColumn(shape=2)
# Write to disk
io.writebin(dm, 'my_datamatrix.dm')
# And read it back from disk!
dm = io.readbin('my_datamatrix.dm')
Stacking (vertically concatenating) DataMatrix objects
You can stack two DataMatrix
objects using the <<
operator. Matching columns will be combined. (Note that row 2 is empty. This is because we have increased the length of dm
in the previous step, causing an empty row to be added.)
dm2 = DataMatrix(length=2)
dm2.col = '☺'
dm2.col2 = 10, 20
dm3 = dm << dm2
print(dm3)
Output:
+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 | 1 | |
| 1 | 2 | |
| 2 | 3 | |
| 3 | ☺ | 10 |
| 4 | ☺ | 20 |
+---+-----+------+
Pro-tip: To stack three or more DataMatrix
objects, using the stack()
function from the operations
module is faster than iteratively using the <<
operator.
from datamatrix import operations as ops
dm4 = ops.stack(dm, dm2, dm3)
Working with columns
Referring to columns
You can refer to columns in two ways: as keys in a dict
or as properties. The two notations are identical for most purposes. The main reason to use a dict
style is when the name of the column is itself variable. Otherwise, the property style is recommended for clarity.
dm['col'] # dict style
dm.col # property style
Creating columns
By assigning a value to a non-existing colum, a new column is created and initialized to this value.
dm.col = 'Another value'
print(dm)
Output:
+---+---------------+
| # | col |
+---+---------------+
| 0 | Another value |
| 1 | Another value |
| 2 | Another value |
+---+---------------+
Renaming columns
dm.rename('col', 'col2')
print(dm)
Output:
+---+---------------+
| # | col2 |
+---+---------------+
| 0 | Another value |
| 1 | Another value |
| 2 | Another value |
+---+---------------+
Deleting columns
You can delete a column using the del
keyword:
dm.col = 'x'
del dm.col2
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | x |
| 1 | x |
| 2 | x |
+---+-----+
Column types
There are five column types:
MixedColumn
is the default column type. This can contain numbers (int
andfloat
), strings (str
), andNone
values. This column type is flexible but not very fast because it is (mostly) implemented in pure Python, rather than usingnumpy
, which is the basis for the other columns. The default value for empty cells is an empty string.FloatColumn
containsfloat
numbers. The default value for empty cells isNAN
.IntColumn
containsint
numbers. (This does not includeINF
, andNAN
, which are of typefloat
in Python.) The default value for empty cells is 0.MultiDimensionalColumn
contains higher-dimensionalfloat
arrays. This allows you to mix higher-dimensional data, such as time series or images, with regular one-dimensional data. The default value for empty cells isNAN
.SeriesColumn
is identical to a two-dimensionalMultiDimensionalColumn
.
When you create a DataMatrix
, you can indicate a default column type.
# Create IntColumns by default
dm = DataMatrix(length=2, default_col_type=int)
dm.i = 1, 2 # This is an IntColumn
You can also explicitly indicate the column type when creating a new column.
dm.f = float # This creates an empty (`NAN`-filled) FloatColumn
dm.i = int # This creates an empty (0-filled) IntColumn
To create a MultiDimensionalColumn
you need to import the column type and specify a shape:
from datamatrix import MultiDimensionalColumn
dm.mdim_col = MultiDimensionalColumn(shape=(2, 3))
print(dm)
Output:
+---+-----+---+-----------------+
| # | f | i | mdim_col |
+---+-----+---+-----------------+
| 0 | nan | 0 | [[nan nan nan] |
| | | | [nan nan nan]] |
| 1 | nan | 0 | [[nan nan nan] |
| | | | [nan nan nan]] |
+---+-----+---+-----------------+
You can also specify named dimensions. For example, ('x', 'y')
creates a dimension of size 2 where index 0 can be referred to as 'x' and index 1 can be referred to as 'y':
dm.mdim_col = MultiDimensionalColumn(shape=(('x', 'y'), 3))
Column properties
Basic numerical properties, such as the mean, can be accessed directly. For this purpose, only numerical, non-NAN
values are taken into account.
dm = DataMatrix(length=3)
dm.col = 1, 2, 'not a number'
# Numeric descriptives
print('mean: %s' % dm.col.mean) # or dm.col[...]
print('median: %s' % dm.col.median)
print('standard deviation: %s' % dm.col.std)
print('sum: %s' % dm.col.sum)
print('min: %s' % dm.col.min)
print('max: %s' % dm.col.max)
# Other properties
print('unique values: %s' % dm.col.unique)
print('number of unique values: %s' % dm.col.count)
print('column name: %s' % dm.col.name)
Output:
mean: 1.5
median: 1.5
standard deviation: 0.7071067811865476
sum: 3.0
min: 1.0
max: 2.0
unique values: [1, 2, 'not a number']
number of unique values: 3
column name: col
The shape
property indicates the number and sizes of the dimensions of the column. For regular columns, the shape is a tuple containing only the length of the datamatrix (the number of rows). For multidimensional columns, the shape is a tuple containing the length of the datamatrix and the shape of cells as specified through the shape
keyword.
print(dm.col.shape)
dm.mdim_col = MultiDimensionalColumn(shape=(2, 4))
print(dm.mdim_col.shape)
Output:
(3,)
(3, 2, 4)
The loaded
property indicates whether a column is currently stored in memory, or whether it is offloaded to disk. This is mainly relevant for multidimensional columns, which are automatically offloaded to disk when memory runs low.
print(dm.mdim_col.loaded)
Output:
True
Assigning
Assigning by index, multiple indices, or slice
You can assign a single value to one or more cells in various ways.
dm = DataMatrix(length=4)
# Create a new columm
dm.col = ''
# By index: assign to a single cell (at row 1)
dm.col[1] = ':-)'
# By a tuple (or other iterable) of multiple indices:
# assign to cells at rows 0 and 2
dm.col[0, 2] = ':P'
# By slice: assign from row 1 until the end
dm.col[2:] = ':D'
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | :P |
| 1 | :-) |
| 2 | :D |
| 3 | :D |
+---+-----+
You can also assign multiple values at once, provided that the to-be-assigned sequence is of the correct length.
# Assign to the full column
dm.col = 1, 2, 3, 4
# Assign to two cells
dm.col[0, 2] = 'a', 'b'
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | a |
| 1 | 2 |
| 2 | b |
| 3 | 4 |
+---+-----+
Assigning to cells that match a selection criterion
As will be described in more detail later on, comparing a column to a value gives a new DataMatrix
that contains only the matching rows. This subsetted DataMatrix
can in turn be used to assign to the matching rows of the original DataMatrix
. This sounds a bit abstract but is very easy in practice:
dm.col[1:] = ':D'
dm.is_happy = 'no'
dm.is_happy[dm.col == ':D'] = 'yes'
print(dm)
Output:
+---+-----+----------+
| # | col | is_happy |
+---+-----+----------+
| 0 | a | no |
| 1 | :D | yes |
| 2 | :D | yes |
| 3 | :D | yes |
+---+-----+----------+
Assigning to multidimensional columns
Assigning to multidimensional columns works much the same as assigning to regular columns. The main differences are that there are multiple dimensions, and that dimensions can be named.
dm = DataMatrix(length=2)
dm.mdim_col = MultiDimensionalColumn(shape=(('x', 'y'), 3))
# Set all values to a single value
dm.mdim_col = 1
# Set all last dimensions to a single array of shape 3
dm.mdim_col = [ 1, 2, 3]
# Set all rows to a single array of shape (2, 3)
dm.mdim_col = [[ 1, 2, 3],
[ 4, 5, 6]]
# Set the column to an array of shape (2, 3, 3)
dm.mdim_col = [[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]]
To assign to dimensions by name:
dm.mdim_col[:, 'x'] = 1, 2, 3 # identical to assigning to dm.mdim_col[:, 0]
dm.mdim_col[:, 'y'] = 4, 5, 6 # identical to assigning to dm.mdim_col[:, 1]
Pro-tip: When assigning an array-like object to a multidimensional column, the shape of the to-be-assigned array needs to match the final part of the shape of the column. This means that you can assign a (2, 3) array to a (2, 2, 3) column in which case all rows (the first dimension) are set to the array. shape However, you cannot assign a (2, 2) array to a (2, 2, 3) column.
Accessing
Accessing by index, multiple indices, or slice
dm = DataMatrix(length=4)
# Create a new column
dm.col = 'a', 'b', 'c', 'd'
# By index: select a single cell (at row 1).
print(dm.col[1])
# By a tuple (or other iterable) of multiple indices:
# select cells at rows 0 and 2. This gives a new column.
print(dm.col[0, 2])
# By slice: assign from row 1 until the end. This gives a new column.
print(dm.col[2:])
Output:
b
col['a', 'c']
col['c', 'd']
Accessing and averaging (ellipsis averaging) multidimensional columns
Accessing multidimensional columns works much the same as accessing regular columns. The main differences are that there are multiple dimensions, and that dimensions can be named.
dm = DataMatrix(length=2)
dm.mdim_col = MultiDimensionalColumn(shape=(('x', 'y'), 3))
dm.mdim_col = [[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]]
# From all rows, get index 1 (named 'y') from the second dimension and index 2 from the third dimension.
print(dm.mdim_col[:, 'y', 2])
Output:
col[ 6. 12.]
You can select the average of a column using the ellipsis (...
) index. For regular columns, this is indentical to accessing the mean
property:
dm.col = 1, 2
print(dm.col[...]) # identical to `dm.col.mean`
Output:
1.5
Ellipsis averaging (...
) is especially useful when working with multidimensional data, in which case it allows you to average over specific dimensions. As long as you don't average over the first dimension, which corresponds to the rows of the DataMatrix
, the result is a new column.
# Averaging over the third dimension gives a column of shape (2, 2)
dm.avg3 = dm.mdim_col[:, :, ...]
# Average over the second dimension gives a colum of shape (2, 3)
dm.avg2 = dm.mdim_col[:, ...]
# Averaging over the second and third dimensions gives a `FloatColumn`.
dm.avg23 = dm.mdim_col[:, ..., ...]
print(dm)
Output:
+---+------------------+-------+-----------+-----+-----------------+
| # | avg2 | avg23 | avg3 | col | mdim_col |
+---+------------------+-------+-----------+-----+-----------------+
| 0 | [2.5 3.5 4.5] | 3.5 | [2. 5.] | 1 | [[1. 2. 3.] |
| | | | | | [4. 5. 6.]] |
| 1 | [ 8.5 9.5 10.5] | 9.5 | [ 8. 11.] | 2 | [[ 7. 8. 9.] |
| | | | | | [10. 11. 12.]] |
+---+------------------+-------+-----------+-----+-----------------+
When averaging over the first dimension, which corresponds to the rows of the DataMatrix
, the result is either an array or (if all dimensions are averaged) a float:
# Averaging over the rows gives an array of shape (2, 3)
print(dm.mdim_col[...])
# Averaging over all dimensions gives a float
print(dm.mdim_col[..., ..., ...])
Output:
[[4. 5. 6.]
[7. 8. 9.]]
6.5
Selecting
Selecting by column values
You can select by directly comparing columns to values. This returns a new DataMatrix
object with only the selected rows.
dm = DataMatrix(length=10)
dm.col = range(10)
dm_subset = dm.col > 5
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
+---+-----+
Selecting by multiple criteria with |
(or), &
(and), and ^
(xor)
You can select by multiple criteria using the |
(or), &
(and), and ^
(xor) operators (but not the actual words 'and' and 'or'). Note the parentheses, which are necessary because |
, &
, and ^
have priority over other operators.
dm_subset = (dm.col < 1) | (dm.col > 8)
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | 0 |
| 9 | 9 |
+---+-----+
dm_subset = (dm.col > 1) & (dm.col < 8)
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
+---+-----+
Selecting by multiple criteria by comparing to a set {}
If you want to check whether column values are identical to, or different from, a set of test values, you can compare the column to a set
object. (This is considerably faster than comparing the column values to each of the test values separately, and then merging the result using &
or |
.)
dm_subset = dm.col == {1, 3, 5, 7}
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 1 | 1 |
| 3 | 3 |
| 5 | 5 |
| 7 | 7 |
+---+-----+
Selecting (filtering) with a function or lambda expression
You can also use a function or lambda
expression to select column values. The function must take a single argument and its return value determines whether the column value is selected. This is analogous to the classic filter()
function.
dm_subset = dm.col == (lambda x: x % 2)
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 1 | 1 |
| 3 | 3 |
| 5 | 5 |
| 7 | 7 |
| 9 | 9 |
+---+-----+
Selecting values that match another column (or sequence)
You can also select by comparing a column to a sequence, in which case a row-by-row comparison is done. This requires that the sequence has the same length as the column, is not a set
object (because set
objects are treated as described above).
dm = DataMatrix(length=4)
dm.col = 'a', 'b', 'c', 'd'
dm_subset = dm.col == ['a', 'b', 'x', 'y']
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | a |
| 1 | b |
+---+-----+
Selecting values by type
When a column contains values of different types, you can also select values by type:
dm = DataMatrix(length=4)
dm.col = 'a', 1, 'c', 2
dm_subset = dm.col == int
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 1 | 1 |
| 3 | 2 |
+---+-----+
Getting indices for rows that match selection criteria ('where')
You can get the indices for rows that match certain selection criteria by slicing a DataMatrix
with a subset of itself. This is similar to the numpy.where()
function.
dm = DataMatrix(length=4)
dm.col = 1, 2, 3, 4
indices = dm[(dm.col > 1) & (dm.col < 4)]
print(indices)
Output:
[1, 2]
Selecting a subset of columns
You can select a subset of columns by passing the columns as an index to dm[]
. Columns can be specified by name ('col3') or by object (dm.col1
).
dm = DataMatrix(length=4)
dm.col1 = '☺'
dm.col2 = 'a'
dm.col3 = 1
dm_subset = dm[dm.col1, 'col3']
print(dm_subset)
Output:
+---+------+------+
| # | col1 | col3 |
+---+------+------+
| 0 | ☺ | 1 |
| 1 | ☺ | 1 |
| 2 | ☺ | 1 |
| 3 | ☺ | 1 |
+---+------+------+
Element-wise column operations
Multiplication, addition, etc.
You can apply basic mathematical operations on all cells in a column simultaneously. Cells with non-numeric values are ignored, except by the +
operator, which then results in concatenation.
dm = DataMatrix(length=3)
dm.col = 0, 'a', 20
dm.col2 = dm.col * .5
dm.col3 = dm.col + 10
dm.col4 = dm.col - 10
dm.col5 = dm.col / 50
print(dm)
Output:
+---+-----+------+------+------+------+
| # | col | col2 | col3 | col4 | col5 |
+---+-----+------+------+------+------+
| 0 | 0 | 0.0 | 10 | -10 | 0.0 |
| 1 | a | a | a10 | a | a |
| 2 | 20 | 10.0 | 30 | 10 | 0.4 |
+---+-----+------+------+------+------+
Applying (mapping) a function or lambda expression
You can apply a function or lambda
expression to all cells in a column simultaneously with the @
operator. This analogous to the classic map()
function.
dm = DataMatrix(length=3)
dm.col = 0, 1, 2
dm.col2 = dm.col @ (lambda x: x*2)
print(dm)
Output:
+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 | 0 | 0 |
| 1 | 1 | 2 |
| 2 | 2 | 4 |
+---+-----+------+
Iterating over rows, columns, and cells (for loops)
By iterating directly over a DataMatrix
object, you get successive Row
objects. From a Row
object, you can directly access cells.
dm.col = 'a', 'b', 'c'
for row in dm:
print(row)
print(row.col)
Output:
+------+-------+
| Name | Value |
+------+-------+
| col | a |
| col2 | 0 |
+------+-------+
a
+------+-------+
| Name | Value |
+------+-------+
| col | b |
| col2 | 2 |
+------+-------+
b
+------+-------+
| Name | Value |
+------+-------+
| col | c |
| col2 | 4 |
+------+-------+
c
By iterating over DataMatrix.columns
, you get successive (column_name, column)
tuples.
for colname, col in dm.columns:
print('%s = %s' % (colname, col))
Output:
col = col['a', 'b', 'c']
col2 = col[0, 2, 4]
By iterating over a column, you get successive cells:
for cell in dm.col:
print(cell)
Output:
a
b
c
By iterating over a Row
object, you get (column_name, cell
) tuples:
row = dm[0] # Get the first row
for colname, cell in row:
print('%s = %s' % (colname, cell))
Output:
col = a
col2 = 0
The column_names
property gives a sorted list of all column names (without the corresponding column objects):
print(dm.column_names)
Output:
['col', 'col2']
Miscellanous notes
Type conversion and character encoding
For MixedColumn
:
- The strings 'nan', 'inf', and '-inf' are converted to the corresponding
float
values (NAN
,INF
, and-INF
). - Byte-string values (
bytes
) are automatically converted tostr
assumingutf-8
encoding. - Trying to assign an unsupported type results in a
TypeError
. - The string 'None' is not converted to the type
None
.
For FloatColumn
:
- The strings 'nan', 'inf', and '-inf' are converted to the corresponding
float
values (NAN
,INF
, and-INF
). - Unsupported types are converted to
NAN
. A warning is shown.
For IntColumn
:
- Trying to assign non-
int
values results in aTypeError
.
NAN and INF values
You have to take special care when working with nan
data. In general, nan
is not equal to anything else, not even to itself: nan != nan
. You can see this behavior when selecting data from a FloatColumn
with nan
values in it.
from datamatrix import DataMatrix, FloatColumn, NAN
dm = DataMatrix(length=3)
dm.f = FloatColumn
dm.f = 0, NAN, 1
dm = dm.f == [0, NAN, 1]
print(dm)
Output:
+---+-----+
| # | f |
+---+-----+
| 0 | 0.0 |
| 2 | 1.0 |
+---+-----+
However, for convenience, you can select all nan
values by comparing a FloatColumn
to a single nan
value:
dm = DataMatrix(length=3)
dm.f = FloatColumn
dm.f = 0, NAN, 1
print(dm.f == NAN)
print('NaN values')
print('Non-NaN values')
print(dm.f != NAN)
Output:
+---+-----+
| # | f |
+---+-----+
| 1 | nan |
+---+-----+
NaN values
Non-NaN values
+---+-----+
| # | f |
+---+-----+
| 0 | 0.0 |
| 2 | 1.0 |
+---+-----+