Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Multics is security spelled sideways.


devel / comp.lang.python / Filtering XArray Datasets?

SubjectAuthor
* Filtering XArray Datasets?Israel Brewster
`- Re: Filtering XArray Datasets?Dennis Lee Bieber

1
Filtering XArray Datasets?

<mailman.547.1654554526.20749.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18481&group=comp.lang.python#18481

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ijbrews...@alaska.edu (Israel Brewster)
Newsgroups: comp.lang.python
Subject: Filtering XArray Datasets?
Date: Mon, 6 Jun 2022 14:28:41 -0800
Lines: 36
Message-ID: <mailman.547.1654554526.20749.python-list@python.org>
References: <54352CCC-E1A5-4AF7-9E19-6B538A13A459@alaska.edu>
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\))
Content-Type: text/plain;
charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de K1HYZ/+6GPi3EJ9LludGCwlzbsCfPohKxVxS+opcsEnw==
Return-Path: <ijbrewster@alaska.edu>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=alaska.edu header.i=@alaska.edu header.b=C0I1Mofu;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.003
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'engineer': 0.02; 'looks':
0.02; '(which': 0.04; 'cell:': 0.05; 'usage': 0.05; 'filter':
0.07; 'performing': 0.07; 'memory.': 0.09; 'numpy': 0.09;
'memory': 0.15; 'complete,': 0.16; 'dataset': 0.16; 'dataset.':
0.16; 'datasets': 0.16; 'datasets,': 0.16; 'israel': 0.16;
'issues:': 0.16; 'object,': 0.16; 'received:apple': 0.16;
'received:smtpclient.apple': 0.16; 'work:': 0.16; 'problem': 0.16;
'values': 0.17; 'uses': 0.19; 'helvetica;': 0.19; 'to:addr:python-
list': 0.20; 'backed': 0.26; 'object': 0.26; 'creating': 0.27;
'amounts': 0.32; 'to:name:python': 0.32; 'but': 0.32; 'there':
0.33; '0);': 0.33; 'rgb(0,': 0.33; 'received:google.com': 0.34;
'applying': 0.36; 'work,': 0.36; 'currently': 0.37; 'really':
0.37; 'using': 0.37; 'received:209.85': 0.37; 'way': 0.38;
'received:209': 0.39; 'two': 0.39; 'use': 0.39; 'appears': 0.40;
'seconds': 0.40; 'want': 0.40; 'font-family:': 0.60; 'none;':
0.62; 'font-weight:': 0.62; 'true': 0.63; '0px;': 0.63; 'skip:b
10': 0.63; 'text-indent:': 0.64; 'less': 0.65; 'normal;': 0.66;
'8bit%:21': 0.68; 'skip:9 10': 0.68; 'subject:? ': 0.69;
'potentially': 0.76; 'points': 0.84; 'geophysical': 0.84; 'say,':
0.84; 'similarly': 0.84; 'greater': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alaska.edu; s=google;
h=from:mime-version:subject:message-id:date:to;
bh=9RERCHH/JrIR2wimh/5uWoxFOUT/weUNrHlLigPVvwI=;
b=C0I1Mofuyon0PMSWbW45srpDP6/w9Wvioa6G3zBf5dGwPL7wUXE/5IBurnMFgJLeHE
Jbd0BxPKBKeykX7ghiGnjUKiTQYS9oLT3ONa3X7qBPfz5ELgmz2FtzqTruqSAX2N2ght
ynPt+sArVP/7JyVpI08YIQ9+JxMsnzW+C1QPJYZdmQwEkHumUhuyXEXY5wNy0seAVmtn
12NpeJ8eGw3CIvmqqdZzHMEsQso78d+5VUOfbphi1CgqUtCmkMlXk5pGjD81beLAsUpQ
DbdoR3SHYYd1xXj3CrMzBXeYmqLv85RxtJLqm0ByqPbfDeURk3IukDygvAZEOoQJ5Mtg
M8Yg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:from:mime-version:subject:message-id:date:to;
bh=9RERCHH/JrIR2wimh/5uWoxFOUT/weUNrHlLigPVvwI=;
b=QAqHlO3JnNcJ5gf0LzNHoxFIGvG95b+q9fvcCEGRvIY+5RlXePYmr2+5QM2OccrDIm
a6RiNUzTz13ibp4UZNoo6S+p5xOjzA4Gmi5Tjga7xJ1vNGEqZZ6XjffQLzPfp1EsIGeV
khS3ZCyLWy5m05dGPuzWKgIQjsRUsi0d6vmopVxy0MAwjF7kYwSHN2sxlM6VA7DPaeVZ
fy7XMXWFrpQM3A6EZSHtX/Hx86Rlfy5P2KJawUEQ+SPy5BQ5lxCU4dv+FGWg9x8rtMaF
ZLxHwuFz7XrpWaTi7VDtyOFEhgsQxg3ivoPRtz+2i3Ti/HhGaLGKqUrIzi1Wcg104+Ic
LZOA==
X-Gm-Message-State: AOAM5303XQAyP6lw7R0Dx/ZnW2BUga4Sy7umdGwkB9ZegWBtzhyzyUIM
BZSwA6Qi+SP/QLYd5izeVLNeep7nfZ1G1ER2u/p79El9mnNCgRw12Vo5JLJQhLySCJBA6NcEN2m
LL9JzSrm6tmKICCQUIqPJIPtpskJb6SASbzMQyE+wEnyrpabX22XRUVpb0noymOhm/pip+Q==
X-Google-Smtp-Source: ABdhPJzgYo3lNsy57OoPlH8P1zgXOD7sUCL/CNDk74Ue7DifFFML3wkocOerWAB09buRI/YsnhcZ7w==
X-Received: by 2002:a05:6a00:2148:b0:4fa:92f2:bae3 with SMTP id
o8-20020a056a00214800b004fa92f2bae3mr26562383pfk.69.1654554522865;
Mon, 06 Jun 2022 15:28:42 -0700 (PDT)
X-Mailer: Apple Mail (2.3696.100.31)
X-Content-Filtered-By: Mailman/MimeDel 2.1.39
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <54352CCC-E1A5-4AF7-9E19-6B538A13A459@alaska.edu>
 by: Israel Brewster - Mon, 6 Jun 2022 22:28 UTC

I have some large (>100GB) datasets loaded into memory in a two-dimensional (X and Y) NumPy array backed XArray dataset. At one point I want to filter the data using a boolean array created by performing a boolean operation on the dataset that is, I want to filter the dataset for all points with a longitude value greater than, say, 50 and less than 60, just to give an example (hopefully that all makes sense?).

Currently I am doing this by creating a boolean array (data[‘latitude’]>50, for example), and then applying that boolean array to the dataset using .where(), with drop=True. This appears to work, but has two issues:

1) It’s slow. On my large datasets, applying where can take several minutes (vs. just seconds to use a boolean array to index a similarly sized numpy array)
2) It uses large amounts of memory (which is REALLY a problem when the array is already using 100GB+)

What it looks like is that values corresponding to True in the boolean array are copied to a new XArray object, thereby potentially doubling memory usage until it is complete, at which point the original object can be dropped, thereby freeing the memory.

Is there any solution for these issues? Some way to do an in-place filtering?
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

Re: Filtering XArray Datasets?

<brgt9hh9lqdd7g6vupi1fbankh59ccihs6@4ax.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18482&group=comp.lang.python#18482

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!buffer2.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Mon, 06 Jun 2022 22:29:02 -0500
From: wlfr...@ix.netcom.com (Dennis Lee Bieber)
Newsgroups: comp.lang.python
Subject: Re: Filtering XArray Datasets?
Date: Mon, 06 Jun 2022 23:29:02 -0400
Organization: IISS Elusive Unicorn
Message-ID: <brgt9hh9lqdd7g6vupi1fbankh59ccihs6@4ax.com>
References: <54352CCC-E1A5-4AF7-9E19-6B538A13A459@alaska.edu> <mailman.547.1654554526.20749.python-list@python.org>
User-Agent: ForteAgent/8.00.32.1272
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Lines: 34
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-X5MzrSeRbInP0/VE11TtDCi/6BVS/VYmKHNo0NV1QTlrt6fPssySNy92yKclCp5iqhBM6U9tPS3kdj4!aq1slmjWFncd1x7gcjMpNiHCo5pUdD+oYkAH6eKZenSz0b27VHzbmFs3oB+22a7In8sOgIeO
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 2739
 by: Dennis Lee Bieber - Tue, 7 Jun 2022 03:29 UTC

On Mon, 6 Jun 2022 14:28:41 -0800, Israel Brewster <ijbrewster@alaska.edu>
declaimed the following:

>I have some large (>100GB) datasets loaded into memory in a two-dimensional (X and Y) NumPy array backed

Unless you have some massive number cruncher machine, with TB RAM, you
are running with a lot of page swap -- and not just cached pages in unused
RAM; actual disk I/O.

Pretty much anything that has to scan the data is going to be slow!

>
>Currently I am doing this by creating a boolean array (data[‘latitude’]>50, for example), and then applying that boolean array to the dataset using .where(), with drop=True. This appears to work, but has two issues:
>

FYI: your first paragraph said "longitude", not "latitude".

>1) It’s slow. On my large datasets, applying where can take several minutes (vs. just seconds to use a boolean array to index a similarly sized numpy array)
>2) It uses large amounts of memory (which is REALLY a problem when the array is already using 100GB+)
>

Personally, given the size of the data, and that it is going to involve
lots of page swapping... I'd try to convert the datasets into some RDBM --
maybe with indices defined for latitude/longitude columns, allowing queries
to scan the index to find matching records, and return those (perhaps for
processing one at a time "for rec in cursor:" rather than doing a
..fetchall().

Some RDBMs even have extensions for spatial data handling.

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor