Basic use
Ultra-short cheat sheet:
from datamatrix import DataMatrix, io
# Read a DataMatrix from file
dm = io.readtxt('data.csv')
# Create a new DataMatrix
dm = DataMatrix(length=5)
# The first two rows
print(dm[:2])
# Create a new column and initialize it with the Fibonacci series
dm.fibonacci = 0, 1, 1, 2, 3
# You can also specify column names as if they are dict keys
dm['fibonacci'] = 0, 1, 1, 2, 3
# Remove 0 and 3 with a simple selection
dm = (dm.fibonacci > 0) & (dm.fibonacci < 3)
# Get a list of indices that match certain criteria
print(dm[(dm.fibonacci > 0) & (dm.fibonacci < 3)])
# Select 1, 1, and 2 by matching any of the values in a set
dm = dm.fibonacci == {1, 2}
# Select all odd numbers with a lambda expression
dm = dm.fibonacci == (lambda x: x % 2)
# Change all 1s to -1
dm.fibonacci[dm.fibonacci == 1] = -1
# The first two cells from the fibonacci column
print(dm.fibonacci[:2])
# Column mean
print('Mean: %s' % dm.fibonacci.mean)
# Multiply all fibonacci cells by 2
dm.fibonacci_times_two = dm.fibonacci * 2
# Loop through all rows
for row in dm:
print(row.fibonacci) # get the fibonacci cell from the row
# Loop through all columns
for colname, col in dm.columns:
for cell in col: # Loop through all cells in the column
print(cell) # do something with the cell
# Or just see which columns exist
print(dm.column_names)
Important note: Because of a limitation (or feature, if you will) of the Python language, the behavior of and
, or
, and chained (x < y < z
) comparisons cannot be modified. These therefore do not work with DataMatrix
objects as you would expect them to:
# INCORRECT: The following does *not* work as expected
dm = dm.fibonacci > 0 and dm.fibonacci < 3
# INCORRECT: The following does *not* work as expected
dm = 0 < dm.fibonacci < 3
# CORRECT: Use the '&' operator
dm = (dm.fibonacci > 0) & (dm.fibonacci < 3)
Slightly longer cheat sheet:
- Basic operations
- Creating a DataMatrix
- Concatenating two DataMatrix objects
- Creating columns
- Renaming columns
- Deleting columns
- Slicing and assigning to column cells
- Column properties
- Iterating over rows, columns, and cells
- Selecting data
- Comparing a column to a value
- Selecting by multiple criteria with | (or), & (and), and ^ (xor)
- Selecting by multiple criteria by comparing to a set {}
- Selecting with a function or lambda expression
- Selecting values that match another column (or sequence)
- Getting indices for rows that match selection criteria ('where')
- Element-wise column operations
- Reading and writing files
- Column types
- Working with continuous data (requires numpy)
Basic operations
Creating a DataMatrix
Create a new DataMatrix
object, and add a column (named col
). By default, the column is of the MixedColumn
type, which can store numeric and string data.
import sys
from datamatrix import DataMatrix, __version__
dm = DataMatrix(length=2)
dm.col = ':-)'
print(
'Examples generated with DataMatrix v{} on Python {}\n'.format(
__version__,
sys.version
)
)
print(dm)
Output:
Examples generated with DataMatrix v1.0.0a5 on Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]
+---+-----+
| # | col |
+---+-----+
| 0 | :-) |
| 1 | :-) |
+---+-----+
You can change the length of the DataMatrix
later on. If you reduce the length, data will be lost. If you increase the length, empty cells will be added.
dm.length = 3
Concatenating two DataMatrix objects
You can concatenate two DataMatrix
objects using the <<
operator. Matching columns will be combined. (Note that row 2 is empty. This is because we have increased the length of dm
in the previous step, causing an empty row to be added.)
dm2 = DataMatrix(length=2)
dm2.col = ';-)'
dm2.col2 = 10, 20
dm3 = dm << dm2
print(dm3)
Output:
+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 | :-) | |
| 1 | :-) | |
| 2 | | |
| 3 | ;-) | 10 |
| 4 | ;-) | 20 |
+---+-----+------+
Creating columns
You can change all cells in column to a single value. This creates a new column if it doesn't exist yet.
dm.col = 'Another value'
print(dm)
Output:
+---+---------------+
| # | col |
+---+---------------+
| 0 | Another value |
| 1 | Another value |
| 2 | Another value |
+---+---------------+
You can change all cells in a column based on a sequence. This creates a new column if it doesn't exist yet. This sequence must have the same length as the column (3 in this case).
dm.col = 1, 2, 3
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
+---+-----+
If you do not know the name of a column, for example because it is defined by a variable, you can also refer to columns as though they are items of a dict
. However, this is not recommended, because it makes it less clear whether you are referring to column or a row.
dm['col'] = 'X'
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | X |
| 1 | X |
| 2 | X |
+---+-----+
Renaming columns
dm.rename('col', 'col2')
print(dm)
Output:
+---+------+
| # | col2 |
+---+------+
| 0 | X |
| 1 | X |
| 2 | X |
+---+------+
Deleting columns
You can delete a column using the del
keyword:
dm.col = 'x'
del dm.col2
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | x |
| 1 | x |
| 2 | x |
+---+-----+
Slicing and assigning to column cells
Assign to one cell
dm.col[1] = ':-)'
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | x |
| 1 | :-) |
| 2 | x |
+---+-----+
Assign to multiple cells
This changes row 0 and 2. It is not a slice!
dm.col[0,2] = ':P'
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | :P |
| 1 | :-) |
| 2 | :P |
+---+-----+
Assign to a slice of cells
dm.col[1:] = ':D'
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | :P |
| 1 | :D |
| 2 | :D |
+---+-----+
Assign to cells that match a selection criterion
dm.col[1:] = ':D'
dm.is_happy = 'no'
dm.is_happy[dm.col == ':D'] = 'yes'
print(dm)
Output:
+---+-----+----------+
| # | col | is_happy |
+---+-----+----------+
| 0 | :P | no |
| 1 | :D | yes |
| 2 | :D | yes |
+---+-----+----------+
Column properties
Basic numeric properties, such as the mean, can be accessed directly. Only numeric values are taken into account.
dm.col = 1, 2, 'not a number'
# Numeric descriptives
print('mean: %s' % dm.col.mean)
print('median: %s' % dm.col.median)
print('standard deviation: %s' % dm.col.std)
print('sum: %s' % dm.col.sum)
print('min: %s' % dm.col.min)
print('max: %s' % dm.col.max)
# Other properties
print('unique values: %s' % dm.col.unique)
print('number of unique values: %s' % dm.col.count)
print('column name: %s' % dm.col.name)
Output:
mean: 1.5
median: 1.5
standard deviation: 0.7071067811865476
sum: 3.0
min: 1.0
max: 2.0
unique values: [1, 2, 'not a number']
number of unique values: 3
column name: col
Iterating over rows, columns, and cells
By iterating directly over a DataMatrix
object, you get successive Row
objects. From a Row
object, you can directly access cells.
dm.col = 'a', 'b', 'c'
for row in dm:
print(row)
print(row.col)
Output:
+----------+-------+
| Name | Value |
+----------+-------+
| col | a |
| is_happy | no |
+----------+-------+
a
+----------+-------+
| Name | Value |
+----------+-------+
| col | b |
| is_happy | yes |
+----------+-------+
b
+----------+-------+
| Name | Value |
+----------+-------+
| col | c |
| is_happy | yes |
+----------+-------+
c
By iterating over DataMatrix.columns
, you get successive (column_name, column)
tuples.
for colname, col in dm.columns:
print('%s = %s' % (colname, col))
Output:
col = col['a', 'b', 'c']
is_happy = col['no', 'yes', 'yes']
By iterating over a column, you get successive cells:
for cell in dm.col:
print(cell)
Output:
a
b
c
By iterating over a Row
object, you get (column_name, cell
) tuples:
row = dm[0] # Get the first row
for colname, cell in row:
print('%s = %s' % (colname, cell))
Output:
col = a
is_happy = no
The column_names
property gives a sorted list of all column names (without the corresponding column objects):
print(dm.column_names)
Output:
['col', 'is_happy']
Selecting data
Comparing a column to a value
You can select by directly comparing columns to values. This returns a new DataMatrix
object with only the selected rows.
dm = DataMatrix(length=10)
dm.col = range(10)
dm_subset = dm.col > 5
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
+---+-----+
Selecting by multiple criteria with |
(or), &
(and), and ^
(xor)
You can select by multiple criteria using the |
(or), &
(and), and ^
(xor) operators (but not the actual words 'and' and 'or'). Note the parentheses, which are necessary because |
, &
, and ^
have priority over other operators.
dm_subset = (dm.col < 1) | (dm.col > 8)
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | 0 |
| 9 | 9 |
+---+-----+
dm_subset = (dm.col > 1) & (dm.col < 8)
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
+---+-----+
Selecting by multiple criteria by comparing to a set {}
If you want to check whether column values are identical to, or different from, a set of test values, you can compare the column to a set
object. (This is considerably faster than comparing the column values to each of the test values separately, and then merging the result using &
or |
.)
dm_subset = dm.col == {1, 3, 5, 7}
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 1 | 1 |
| 3 | 3 |
| 5 | 5 |
| 7 | 7 |
+---+-----+
Selecting with a function or lambda expression
You can also use a function or lambda
expression to select column values. The function must take a single argument and its return value determines whether the column value is selected. This is analogous to the classic filter()
function.
dm_subset = dm.col == (lambda x: x % 2)
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 1 | 1 |
| 3 | 3 |
| 5 | 5 |
| 7 | 7 |
| 9 | 9 |
+---+-----+
Selecting values that match another column (or sequence)
You can also select by comparing a column to a sequence, in which case a row-by-row comparison is done. This requires that the sequence has the same length as the column, is not a set
object (because set
objects are treated as described above).
dm = DataMatrix(length=4)
dm.col = 'a', 'b', 'c', 'd'
dm_subset = dm.col == ['a', 'b', 'x', 'y']
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | a |
| 1 | b |
+---+-----+
When a column contains values of different types, you can also select values by type: (Note: On Python 2, all str
values are automatically decoded to unicode
, so you'd need to compare the column to unicode
to extract str
values.)
dm = DataMatrix(length=4)
dm.col = 'a', 1, 'c', 2
dm_subset = dm.col == int
print(dm_subset)
Output:
+---+-----+
| # | col |
+---+-----+
| 1 | 1 |
| 3 | 2 |
+---+-----+
Getting indices for rows that match selection criteria ('where')
You can get the indices for rows that match certain selection criteria by slicing a DataMatrix
with a subset of itself. This is similar to the numpy.where()
function.
dm = DataMatrix(length=4)
dm.col = 1, 2, 3, 4
print(dm[(dm.col > 1) & (dm.col < 4)])
Output:
[1, 2]
Element-wise column operations
Multiplication, addition, etc.
You can apply basic mathematical operations on all cells in a column simultaneously. Cells with non-numeric values are ignored, except by the +
operator, which then results in concatenation.
dm = DataMatrix(length=3)
dm.col = 0, 'a', 20
dm.col2 = dm.col*.5
dm.col3 = dm.col+10
dm.col4 = dm.col-10
dm.col5 = dm.col/50
print(dm)
Output:
+---+-----+------+------+------+------+
| # | col | col2 | col3 | col4 | col5 |
+---+-----+------+------+------+------+
| 0 | 0 | 0.0 | 10 | -10 | 0.0 |
| 1 | a | a | a10 | a | a |
| 2 | 20 | 10.0 | 30 | 10 | 0.4 |
+---+-----+------+------+------+------+
Applying a function or lambda expression
@
operator is only available in Python 3.5 and later.
You can apply a function or lambda
expression to all cells in a column simultaneously with the @
operator.
dm = DataMatrix(length=3)
dm.col = 0, 1, 2
dm.col2 = dm.col @ (lambda x: x*2)
print(dm)
Output:
+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 | 0 | 0 |
| 1 | 1 | 2 |
| 2 | 2 | 4 |
+---+-----+------+
Reading and writing files
You can read and write files with functions from the datamatrix.io
module. The main supported file types are csv
and xlsx
.
from datamatrix import io
dm = DataMatrix(length=3)
dm.col = 1, 2, 3
# Write to disk
io.writetxt(dm, 'my_datamatrix.csv')
io.writexlsx(dm, 'my_datamatrix.xlsx')
# And read it back from disk!
dm = io.readtxt('my_datamatrix.csv')
dm = io.readxlsx('my_datamatrix.xlsx')
Column types
When you create a DataMatrix
, you can indicate a default column type. If you do not specify a default column type, a MixedColumn
is used by default.
from datamatrix import DataMatrix, IntColumn
dm = DataMatrix(length=2, default_col_type=IntColumn)
dm.i = 1, 2 # This is an IntColumn
You can also explicitly indicate the column type when creating a new column:
from datamatrix import FloatColumn
dm.f = FloatColumn
MixedColumn (default)
A MixedColumn
contains text (unicode
in Python 2, str
in Python 3), int
, float
, or None
.
Important notes:
utf-8
encoding is assumed for byte strings- String with numeric values, including
NAN
andINF
, are automatically converted to the most appropriate type - The string 'None' is not converted to the type
None
- Trying to assign a non-supported type results in a
TypeError
from datamatrix import DataMatrix, NAN, INF
dm = DataMatrix(length=12)
dm.datatype = (
'int',
'int (converted)',
'float',
'float (converted)',
'None',
'str',
'float',
'float (converted)',
'float',
'float (converted)',
'float',
'float (converted)',
)
dm.value = (
1,
'1',
1.2,
'1.2',
None,
'None',
NAN,
'nan',
INF,
'inf',
-INF,
'-inf'
)
print(dm)
Output:
+----+-------------------+-------+
| # | datatype | value |
+----+-------------------+-------+
| 0 | int | 1 |
| 1 | int (converted) | 1 |
| 2 | float | 1.2 |
| 3 | float (converted) | 1.2 |
| 4 | None | None |
| 5 | str | None |
| 6 | float | nan |
| 7 | float (converted) | nan |
| 8 | float | INF |
| 9 | float (converted) | INF |
| 10 | float | -inf |
| 11 | float (converted) | -inf |
+----+-------------------+-------+
IntColumn (requires numpy)
The IntColumn
contains only int
values. As of 0.14, the easiest way to create a IntColumn
column is to assign int
to a new column name.
Important notes:
- Trying to assign a value that cannot be converted to an
int
results in aTypeError
- Float values will be rounded down (i.e. the decimals will be lost)
NAN
orINF
values are not supported because these arefloat
from datamatrix import DataMatrix
dm = DataMatrix(length=2)
dm.i = int
dm.i = 1, 2
print(dm)
Output:
+---+---+
| # | i |
+---+---+
| 0 | 1 |
| 1 | 2 |
+---+---+
If you insert non-int
values, they are automatically converted to int
if possible. Decimals are discarded (i.e. values are floored, not rounded):
dm.i = '3', 4.7
print(dm)
Output:
+---+---+
| # | i |
+---+---+
| 0 | 3 |
| 1 | 4 |
+---+---+
If you insert values that cannot converted to int
, a TypeError
is raised:
try:
dm.i = 'x'
except TypeError as e:
print(repr(e))
Output:
TypeError('IntColumn expects integers, not x')
FloatColumn (requires numpy)
The FloatColumn
contains float
, nan
, and inf
values. As of 0.14, the easiest way to create a FloatColumn
column is to assign float
to a new column name.
Important notes:
- Values that are accepted by a
MixedColumn
but cannot be converted to a numeric value becomeNAN
. Examples are non-numeric strings orNone
. - Trying to assign a non-supported type results in a
TypeError
import numpy as np
from datamatrix import DataMatrix, FloatColumn
dm = DataMatrix(length=3)
dm.f = float
dm.f = 1, np.nan, np.inf
print(dm)
Output:
+---+-----+
| # | f |
+---+-----+
| 0 | 1.0 |
| 1 | nan |
| 2 | INF |
+---+-----+
If you insert other values, they are automatically converted if possible.
dm.f = '3.3', 'inf', 'nan'
print(dm)
Output:
+---+-----+
| # | f |
+---+-----+
| 0 | 3.3 |
| 1 | INF |
| 2 | nan |
+---+-----+
If you insert values that cannot be converted to float
, they become nan
.
dm.f = 'x'
print(dm)
Output:
/home/sebastiaan/anaconda3/envs/pydata/lib/python3.10/site-packages/datamatrix/py3compat.py:105: UserWarning: Invalid type for FloatColumn: x
warnings.warn(safe_str(msg), *args)
[32mâ ´[0m Generating...+---+-----+
| # | f |
+---+-----+
| 0 | nan |
| 1 | nan |
| 2 | nan |
+---+-----+
nan
data!
You have to take special care when working with nan
data. In general, nan
is not equal to anything else, not even to itself: nan != nan
. You can see this behavior when selecting data from a FloatColumn
with nan
values in it.
from datamatrix import DataMatrix, FloatColumn
dm = DataMatrix(length=3)
dm.f = FloatColumn
dm.f = 0, np.nan, 1
dm = dm.f == [0, np.nan, 1]
print(dm)
Output:
+---+-----+
| # | f |
+---+-----+
| 0 | 0.0 |
| 2 | 1.0 |
+---+-----+
However, for convenience, you can select all nan
values by comparing a FloatColumn
to a single nan
value:
from datamatrix import DataMatrix, FloatColumn
dm = DataMatrix(length=3)
dm.f = FloatColumn
dm.f = 0, np.nan, 1
print('NaN values')
print(dm.f == np.nan)
print('Non-NaN values')
print(dm.f != np.nan)
Output:
NaN values
+---+-----+
| # | f |
+---+-----+
| 1 | nan |
+---+-----+
Non-NaN values
+---+-----+
| # | f |
+---+-----+
| 0 | 0.0 |
| 2 | 1.0 |
+---+-----+
Working with continuous data (requires numpy)
To work with continous (or time-series) data, datamatrix provides the SeriesColumn
class. In a series column, each cell is itself a series of values.
A more elaborate tutorial on working with time series can be found here:
Mixing two- and three-dimensional data
With column-based or tabular data, every cell is defined by two coordinates: the column name, and the row number; that is, column-based data is two dimensional. But for many kinds of data, two dimensions is not enough.
To illustrate this, let's imagine that you want to store the population of cities over a period of three years. You could do this by simply adding a column for every year, population2008
, population2009
, population2010
:
from datamatrix import DataMatrix
# Not very elegant!
dm = DataMatrix(length=2)
dm.city = 'Marseille', 'Lyon'
dm.population2010 = 850726, 484344
dm.population2009 = 850602, 479803
dm.population2008 = 851420, 474946
print(dm)
Output:
+---+-----------+----------------+----------------+----------------+
| # | city | population2008 | population2009 | population2010 |
+---+-----------+----------------+----------------+----------------+
| 0 | Marseille | 851420 | 850602 | 850726 |
| 1 | Lyon | 474946 | 479803 | 484344 |
+---+-----------+----------------+----------------+----------------+
In this example, this naive approach is still feasible, because there are only three years, so you need only three columns. But imagine that you want to store the year-by-year population over several centuries. You would then end up with hundreds of columns! Not impossible, but not very elegant either.
It would be much more elegant if you could have a single column for the population, and then give this column a third dimension (a depth) so that it can store the population over time. And that's where the SeriesColumn
comes in.
from datamatrix import DataMatrix, SeriesColumn
# Pretty elegant, right?
dm = DataMatrix(length=2)
dm.city = 'Marseille', 'Lyon'
dm.population = SeriesColumn(depth=3)
dm.population[0] = 850726, 850602, 851420 # Marseille
dm.population[1] = 484344, 479803, 474946 # Lyon
dm.year = SeriesColumn(depth=3)
dm.year = 2010, 2009, 2008
print(dm)
Output:
+---+-----------+---------------------------+---------------------+
| # | city | population | year |
+---+-----------+---------------------------+---------------------+
| 0 | Marseille | [850726. 850602. 851420.] | [2010. 2009. 2008.] |
| 1 | Lyon | [484344. 479803. 474946.] | [2010. 2009. 2008.] |
+---+-----------+---------------------------+---------------------+
Basic properties of series
Series columns have the same properties as regular columns: mean
, median
, std
, sum
, min
, and max
. But where these properties are single values for regular columns, they are one-dimensional numpy arrays for series columns.
print(dm.population.mean)
Output:
[667535. 665202.5 663183. ]
Indexing
Accessing
The first dimension of a series dimension refers to the row. So to get the population of Marseille (row 0) over time, you can do:
print(dm.population[0])
Output:
[850726. 850602. 851420.]
The second dimension refers to the depth. So to get the population of both Marseille and Lyon in 2009 (the full slice :
), you can do:
print(dm.population[:, 1])
Output:
col[850602. 479803.]
Assigning
You can assign to a series columns as you would to a 2D numpy array:
dm = DataMatrix(length=2)
dm.s = SeriesColumn(depth=3)
dm.s[0, 0] = 1
dm.s[1:, 1:] = 2
print(dm)
Output:
+---+---------------+
| # | s |
+---+---------------+
| 0 | [ 1. nan nan] |
| 1 | [nan 2. 2.] |
+---+---------------+
If you want to set all cells at once, you can directly assign a single value:
dm.s = 10
print(dm)
Output:
+---+---------------+
| # | s |
+---+---------------+
| 0 | [10. 10. 10.] |
| 1 | [10. 10. 10.] |
+---+---------------+
If you want to set all rows at once, you can directly assign a sequence with a length that is equal to the depth of the series:
dm.s = 100, 200, 300
# Equal to: dm.s[:,:] = 100, 200, 300
print(dm)
Output:
+---+------------------+
| # | s |
+---+------------------+
| 0 | [100. 200. 300.] |
| 1 | [100. 200. 300.] |
+---+------------------+
If you want to set all columns at once, you can directly assing a sequence with a length that is equal to the length of the datamatrix:
dm.s = 1000, 2000
# Equal to: dm.s[:,:] = 1000, 2000
print(dm)
Output:
+---+---------------------+
| # | s |
+---+---------------------+
| 0 | [1000. 1000. 1000.] |
| 1 | [2000. 2000. 2000.] |
+---+---------------------+