fedora-ai-ml
LOGS
<@tflink:fedora.im>
17:31:59
!startmeeting fedora-ai-ml
<@meetbot:fedora.im>
17:31:59
Meeting started at 2024-11-21 17:31:59 UTC
<@meetbot:fedora.im>
17:32:00
The Meeting name is 'fedora-ai-ml'
<@mystro256:fedora.im>
17:32:02
!hi
<@zodbot:fedora.im>
17:32:03
None (mystro256)
<@mystro256:fedora.im>
17:32:12
hello none
<@trix:fedora.im>
17:32:13
!hi
<@zodbot:fedora.im>
17:32:13
Tom Rix (trix)
<@tflink:fedora.im>
17:32:23
ok, that other meeting ended suddenly \o/
<@tflink:fedora.im>
17:32:25
!hi
<@zodbot:fedora.im>
17:32:26
Tim Flink (tflink)
<@tflink:fedora.im>
17:33:12
!info today's agenda (living document): https://board.net/p/fedora-aiml-sig-meeting-agenda
<@tflink:fedora.im>
17:33:35
we have a lot of stuff to cover today so let's get started
<@tflink:fedora.im>
17:33:44
!topic F42 Planning
<@man2dev:fedora.im>
17:33:51
! Hi
<@man2dev:fedora.im>
17:34:02
!Hi
<@mystro256:fedora.im>
17:34:21
I think it's caps sensitive
<@tflink:fedora.im>
17:34:21
Mohammadreza Hendiani: it's just lowercase
<@man2dev:fedora.im>
17:34:37
! hi
<@tflink:fedora.im>
17:34:40
F42 is coming sooner than I'd like :)
<@trix:fedora.im>
17:34:43
it doesn't like being yelled at !?!?
<@tflink:fedora.im>
17:34:50
mass rebuild starts on 2025-01-15
<@tflink:fedora.im>
17:35:03
F42 branch is 2025-02-04
<@trix:fedora.im>
17:35:06
i'd like to have everything settled before then
<@tflink:fedora.im>
17:35:26
F42 beta freeze is 2025-02-18
<@tflink:fedora.im>
17:35:58
!info F42 mass rebuild starts on 2025-01-15
<@tflink:fedora.im>
17:36:07
!info F42 branch is 2025-02-04
<@trix:fedora.im>
17:36:14
big things for me will be 6.3 (assuming it's there) and llvm-rocm (working on now)
<@tflink:fedora.im>
17:36:52
do we want to leave llvm-rocm and the pending llvm18 problem for a separate topic?
<@trix:fedora.im>
17:37:07
if you want.
<@tflink:fedora.im>
17:37:16
!info ROCm 6.3 is planned for F42
<@mystro256:fedora.im>
17:37:52
might be able to get 6.4 as an update if it comes in time
<@mystro256:fedora.im>
17:38:17
no idea what the GA data is though
<@trix:fedora.im>
17:38:19
i would rather not do 6.4 if it is close.
<@mystro256:fedora.im>
17:38:31
yeah 6.3 for sure, 6.4 maybe
<@tflink:fedora.im>
17:38:47
Tom Rix: is there a pytorch update planned? I'm not sure when the next release for that is
<@trix:fedora.im>
17:39:31
oh man hard questions.. i am spending my time on the llvm problem, so pytorch has not gotten any luv.
<@tflink:fedora.im>
17:39:40
fair enough
<@trix:fedora.im>
17:40:09
all of rocm will fall over from llvm problem, including pytorch.
<@tflink:fedora.im>
17:40:58
!info F42 pytorch release is as of yet unknown
<@tflink:fedora.im>
17:42:13
so it looks like the planned feature set for F42 is: rocm 6.3 (maybe 6.4 but not planning on it) and maybe SDL3?
<@tflink:fedora.im>
17:42:41
are there other features that folks are planning on?
<@man2dev:fedora.im>
17:43:32
Oh I wanted to Proposed the SDL3 thing, but considering that it's not stable yet, I don't think it's a very good idea, but maybe packaging it as. As for the proposal, and I couldn't figure out how to submit a wiki, so I never did it.
<@tflink:fedora.im>
17:44:21
Mohammadreza Hendiani: let us know if you want help with submitting a feature for F42. it sounds like SDL3 may not be ready in time for F42, though
<@man2dev:fedora.im>
17:44:45
Yeah
<@tflink:fedora.im>
17:45:17
cool, let us know if that changes in time for F42 features and someone can help you with the wiki bits as needed
<@trix:fedora.im>
17:45:30
ollama is a possible.
<@tflink:fedora.im>
17:45:54
!info ollama is possible for F42 but not yet sure if that work will be finished in time
<@trix:fedora.im>
17:46:09
i'd like to have ollama+apu shiny in F42.
<@tflink:fedora.im>
17:46:29
are you planning to include APU support as part of the ROCm 6.3 feature?
<@trix:fedora.im>
17:46:42
yes, apu is already in
<@tflink:fedora.im>
17:46:54
cool
<@trix:fedora.im>
17:47:09
maybe we add whatever new one comes along in 6.3 1152?
<@trix:fedora.im>
17:47:38
we have 1035,1103 and 1151, 3 generations of laptops apus.
<@trix:fedora.im>
17:48:25
another feature we have that is in now is removing the split libs, everything is out of the usual space. /usr/lib64
<@tflink:fedora.im>
17:48:55
which makes building stuff much less crazy. I'm glad to see that
<@man2dev:fedora.im>
17:48:59
What? 
<@tflink:fedora.im>
17:49:38
the new compression feature in llvm means that the split libs (gfx11, gfx10 etc.) are gone for the rocm packages
<@trix:fedora.im>
17:50:08
yes. that is the gem the amd compiler guys gave us.
<@tflink:fedora.im>
17:50:53
in terms of the writeups for known features, @trix is doing rocm but I think that's the only F42 feature for now
<@trix:fedora.im>
17:51:26
yup, i can do the writeup stuff i just yakked about.
<@tflink:fedora.im>
17:51:54
!info trix will be writing up the F42 change proposal for ROCm
<@tflink:fedora.im>
17:52:15
ok, anything F42 related other than the llvm fun?
<@trix:fedora.im>
17:52:35
testing, but that is related.
<@tflink:fedora.im>
17:52:55
I figured we'd finish up non-llvm F42 planning, talk about llvm and then get to testing
<@trix:fedora.im>
17:53:04
coolio
<@tflink:fedora.im>
17:53:42
!topic rocm and llvm18
<@tflink:fedora.im>
17:54:16
as I understand it, there are two related issues here
<@tflink:fedora.im>
17:54:42
1. F41 was a bit of a disaster and rocm still isn't working w/o updates-testing in F41 due to the late llvm change
<@trix:fedora.im>
17:54:51
yes
<@trix:fedora.im>
17:55:09
s/bit/flaming bag/
<@tflink:fedora.im>
17:55:14
2. ROCm 6.3 will not support llvm19 and llvm18 will be orphaned by the llvm maintainers for F42
<@trix:fedora.im>
17:55:36
yes
<@trix:fedora.im>
17:55:56
llvm20 will come in around F42 beta 2
<@mystro256:fedora.im>
17:55:58
yeah 6.3 is likely going to be llvm 18 based on upstreams feedback
<@mystro256:fedora.im>
17:56:12
6.4 might be 19, but it;s unknown
<@tflink:fedora.im>
17:56:22
so, at a minimum, someone will need to take on the llvm18 compat packages once they're orphaned by the llvm folks
<@trix:fedora.im>
17:56:48
that is an option.
<@tflink:fedora.im>
17:56:55
unless we can somehow convince the llvm folks not to orphan them but I wouldn't get your hopes up there
<@tflink:fedora.im>
17:57:12
the other option is to start bundling llvm and stop depending on system llvm
<@trix:fedora.im>
17:57:37
the bundled llvm is what i have been working on.
<@trix:fedora.im>
17:58:00
as a fallback to the first problem.
<@tflink:fedora.im>
17:58:09
I think we'd have to get a waiver from FESCo to bundle llvm like that but I think we have a pretty good case for it
<@trix:fedora.im>
17:58:13
now as a primary to the second problem.
<@tflink:fedora.im>
17:58:38
llvm changes have blown up ROCm for two releases in a row and I don't see that being fixed until FESCo stops accepting llvm version changes so late
<@trix:fedora.im>
17:59:28
on the orphan question, is there someone that want to pick up llvm18 ?
<@trix:fedora.im>
17:59:44
i will not.
<@mystro256:fedora.im>
18:00:54
well the problem is what is the advantage of using llvm18 over a fork
<@mystro256:fedora.im>
18:01:09
if llvm18 is abandoned, we might as well build the fork
<@tflink:fedora.im>
18:01:16
technical adherence to policy
<@mystro256:fedora.im>
18:01:29
did we ask fesco?
<@tflink:fedora.im>
18:01:34
not yet, no
<@mystro256:fedora.im>
18:01:42
someone should
<@mystro256:fedora.im>
18:01:56
can someone volunteer opening a ticket?
<@tflink:fedora.im>
18:01:59
I can do that unless someone else wants to
<@mystro256:fedora.im>
18:02:24
I just keep forgetting, so if you can, please
<@tflink:fedora.im>
18:02:46
!action tflink to submit ticket to FESCo about bundling llvm for ROCm
<@tflink:fedora.im>
18:03:27
so I think that we're mostly in a holding pattern on this until the FESCo question is answered
<@tflink:fedora.im>
18:03:34
is there anything else on this topic for today?
<@man2dev:fedora.im>
18:03:46
Yes
<@tflink:fedora.im>
18:03:58
I don't understand
<@tflink:fedora.im>
18:04:16
are you saying that there is more on this topic or agreeing with the fact that we need to submit a ticket to fesco
<@man2dev:fedora.im>
18:04:17
Testing infra: testfarm research result:
<@tflink:fedora.im>
18:04:25
that's not llvm related
<@tflink:fedora.im>
18:04:51
but it is the next topic if there's nothing more on llvm for today
<@man2dev:fedora.im>
18:05:01
Oh I though you were talking overal
<@tflink:fedora.im>
18:05:48
!topic HW Testing
<@tflink:fedora.im>
18:06:02
this might get a bit messy, it sounds like there are 3 of us working on this independently
<@tflink:fedora.im>
18:06:07
who wants to go first?
<@man2dev:fedora.im>
18:06:17
- [Testing Farm GitLab](https://gitlab.com/testing-farm)  
<@man2dev:fedora.im>
18:06:17
Testing Farm, primarily supported by AWS infrastructure, provides a robust platform for managing and executing tests with customizable hardware. It is widely used by upstream projects like Systemd and Cockpit to ensure seamless integration and reliable testing workflows. Relevant resources include:  
<@man2dev:fedora.im>
18:06:17
 
<@man2dev:fedora.im>
18:06:17
 Proposal to Enhance Testing Efficiency Using Testing Farm and Associated Tools  
<@man2dev:fedora.im>
18:06:17
- [Testing Farm YouTube Guide](https://www.youtube.com/watch?v=F7C82Fwdvis)
<@man2dev:fedora.im>
18:06:17
- [Testing Farm](https://testing-farm.io)  
<@man2dev:fedora.im>
18:06:25
     testing-farm reserve --compose Fedora-Rawhide
<@man2dev:fedora.im>
18:06:25
     testing-farm reserve --compose Fedora-Rawhide --hardware virtualization.is-virtualized=false
<@man2dev:fedora.im>
18:06:25
    
<@man2dev:fedora.im>
18:06:25
   - Utilize Testing Farm reservations ([docs](https://gitlab.com/testing-farm)) for experiments, e.g.:  
<@man2dev:fedora.im>
18:06:27
   - Automate upstream CI testing through Packit ([docs](https://packit.dev/docs/configuration/upstream/tests)), with results available on the 
<@man2dev:fedora.im>
18:06:27
  - Has wide variety of interfaces from API, test-farm cli tool, tmt cli tool to integrate testing workflows for preferably the upstream projects or downstream in Fedora SRC repo
<@man2dev:fedora.im>
18:06:27
### Integrate:
<@man2dev:fedora.im>
18:06:27
     + TMT ([docs](https://tmt.readthedocs.io/en/stable/)) to manage tests with FMF metadata.  
<@man2dev:fedora.im>
18:06:27
     + ([Fedora CI Mtrix] (#fedora-ci:fedoraproject.org))
<@man2dev:fedora.im>
18:06:27
     + ([Fedora CI docs](https://docs.fedoraproject.org/en-US/ci))
<@man2dev:fedora.im>
18:06:27
   [Packit dashboard](https://dashboard.packit.dev/jobs/testing-farm).
<@man2dev:fedora.im>
18:06:32
 
<@man2dev:fedora.im>
18:06:32
   - [Testing Farm status page](https://status.testing-farm.io).
<@man2dev:fedora.im>
18:06:32
 
<@man2dev:fedora.im>
18:06:32
  ### currently used by:
<@man2dev:fedora.im>
18:06:32
   - Collaborate with Testing Farm to request additional resources and ensure test working for packages like rcom.
<@man2dev:fedora.im>
18:06:32
   systemd and pcockpit:
<@man2dev:fedora.im>
18:06:32
   - cockpit: Performs automated testing with a FMF files, as used by Cockpit ([example FMF file](https://github.com/cockpit-project/starter-kit/blob/main/test/browser/main.fmf)).
<@man2dev:fedora.im>
18:06:34
### Use case
<@man2dev:fedora.im>
18:06:34
 
<@man2dev:fedora.im>
18:06:34
- Expand and standardize  workflows across upstream project like AMD's fork of llvm or, rcom ...
<@tflink:fedora.im>
18:06:51
that's a lot of text to dump in a meeting
<@man2dev:fedora.im>
18:07:19
Main resource for anyone wanting to get good grasp on topic is youtube video https://www.youtube.com/watch?v=F7C82Fwdvis
<@tflink:fedora.im>
18:07:55
unless something significant has changed in the last year, testing farm is not an option for HW specific testing unless we're talking about nvidia in AWS
<@tflink:fedora.im>
18:08:31
I'm trying to wrap my head around what all you're proposing, though
<@man2dev:fedora.im>
18:08:50
Its no itegrated into the fedora ci in some places
<@man2dev:fedora.im>
18:09:56
Sorry I tried to some up all the main point and links in to one text
<@man2dev:fedora.im>
18:10:33
Its now itegrated into the fedora ci in some places
<@man2dev:fedora.im>
18:10:33
> <@tflink:fedora.im> unless something significant has changed in the last year, testing farm is not an option for HW specific testing unless we're talking about nvidia in AWS
<@man2dev:fedora.im>
18:10:33
<@tflink:fedora.im>
18:10:40
outside of the "this is what testing farm is" part, it sounds like a general proposal to use testing farm for the testing that we want to do for ai-ml in Fedora?
<@tflink:fedora.im>
18:10:49
or am I missing something?
<@man2dev:fedora.im>
18:11:39
Yes we can actually use additional resources which are are necessary for our use cases.
<@man2dev:fedora.im>
18:12:09
From the ci sig https://matrix.to/#/!cfWVeczGVJbiKSlrwi:fedoraproject.org
<@tflink:fedora.im>
18:12:23
have those features been added recently? last I checked, there was little to no support for HW specific testing in testing farm
<@tflink:fedora.im>
18:12:37
hopes and dreams, yes. production code and systems, not as much
<@man2dev:fedora.im>
18:12:42
And I also research different ways of how it can be integrated into our workflow.
<@tflink:fedora.im>
18:13:29
<del>hopes and dreams, yes. production code and systems, not as much</del> I didn't mean that to sound as disrepectful as it sounds. I know how hard it is to get systems like that working and was just trying to express that it's hard to use those things until the supporting bits are in production
<@trix:fedora.im>
18:13:52
my workflow is manual, i'd like to stop that.
<@trix:fedora.im>
18:14:19
fedora as a project i don't believe has hw testing
<@man2dev:fedora.im>
18:14:20
My main source of information was the youtube video and they seem to indicate that they do provide hardware-specific builds. But maybe that's false
<@tflink:fedora.im>
18:14:28
I'm not against using TF but I still have concerns that it's not anywhere close to supporting our usecases
<@man2dev:fedora.im>
18:15:04
It has api and CLI so it does support manual workflow
<@trix:fedora.im>
18:15:57
to test, i have to manually build and run a bunch of -test subpackages, then manuall test applications like blender and torch.
<@trix:fedora.im>
18:16:23
so the 'testing' for each release is me do that for a week or two.
<@man2dev:fedora.im>
18:16:54
I haven't tested it I'm just brining it up because They indicate that they can accommodate any computational need that Fedora Project might need as long as the need is valid and is within reason. That's the main reason I thought it might be usefull
<@trix:fedora.im>
18:17:51
llvm monkey wrenched that testing in F40 and F41
<@tflink:fedora.im>
18:18:15
I feel like we just set off a chaos grenade and we're starting to talk past eachother
<@trix:fedora.im>
18:18:23
yes.
<@man2dev:fedora.im>
18:18:37
There are variousways of triggering the build for example on each PR in the upstream project
<@trix:fedora.im>
18:18:46
lets pass the stick.. who wants to talk ?
<@man2dev:fedora.im>
18:18:54
If they add packit
<@trix:fedora.im>
18:18:59
to get to something we can use in F42
<@tflink:fedora.im>
18:19:21
we're talking about solutions before we're talking about what is needed
<@tflink:fedora.im>
18:20:04
well, half of the conversation is around solutions
<@trix:fedora.im>
18:20:14
we have a testing gap. we build for 20 gpus, only 7900 gets any testing and that is manual.
<@tflink:fedora.im>
18:22:18
sorry, struggling with summarizing everything for notes
<@tflink:fedora.im>
18:22:49
!info there is a proposal to start using testing farm for Fedora ai-ml testing
<@tflink:fedora.im>
18:24:46
we have less than 10 minutes left, I propose the following: we discuss the needs we have for the next several minutes and leave the discussion of solutions to a later meeting or another venue (matrix or discourse)
<@tflink:fedora.im>
18:25:06
any objections?
<@trix:fedora.im>
18:25:11
nope
<@man2dev:fedora.im>
18:25:27
No
<@tflink:fedora.im>
18:25:41
!info due to lack of time in this meeting, we will discuss solutions in another venue and leave the discussion to what is needed in this meeting
<@tflink:fedora.im>
18:25:54
!topic ai-ml HW testing needs
<@tflink:fedora.im>
18:26:15
!info there is a huge gap between what we're currently building and what sees regular testing
<@tflink:fedora.im>
18:26:48
!info the level of manual testing we currently have is not sustainable and should be automated if at all possible
<@tflink:fedora.im>
18:27:36
as a more fleshed out point: Tom Rix is one of the only people testing bits right now and mostly on one subset of ROCm (gfx1100)
<@tflink:fedora.im>
18:27:55
as I understand it, the wishlist for automated testing is:
<@tflink:fedora.im>
18:28:15
1. run the rocm self tests on packaging changes (including dependencies)
<@tflink:fedora.im>
18:28:31
2. regularly rebuild rocm and the bits that depend on it to find build errors early
<@trix:fedora.im>
18:28:50
2 is already done.
<@tflink:fedora.im>
18:29:00
ah, I should rephrase that
<@tflink:fedora.im>
18:29:27
2. regularly rebuild rocm and the bits that depend on it to find build errors early in an automated what that doesn't require Tom Rix to do it by hand
<@trix:fedora.im>
18:29:56
https://copr.fedorainfracloud.org/coprs/g/rocm-packagers-sig/RH/
<@tflink:fedora.im>
18:29:58
3. expand our testing matrix to cover at least the most commonly used HW
<@trix:fedora.im>
18:30:22
i set up to a copr to include rocm and its whatrequires
<@trix:fedora.im>
18:30:30
hsakmt
<@tflink:fedora.im>
18:31:04
but that still requires you to do the packaging and submission by hand, no? I thought you were looking for an automated setup so that's happening more often. did I misunderstand?
<@trix:fedora.im>
18:31:44
copr auto builds when someone makes a comit, what else is there to do ?
<@tflink:fedora.im>
18:32:09
dunno, it sounds like i misunderstood what you were looking for, though :)
<@trix:fedora.im>
18:32:43
its ok if we want 2 .
<@trix:fedora.im>
18:32:55
building stuff is the easy part.
<@tflink:fedora.im>
18:33:02
very true
<@trix:fedora.im>
18:33:21
what hw do we want to test ?
<@tflink:fedora.im>
18:33:49
my thought was to start with gfx1100 since that's the easiest and expand from there
<@tflink:fedora.im>
18:34:21
it'd depend heavily on what HW we can get our hands on
<@trix:fedora.im>
18:35:05
ok, automate gfx1100 first.
<@tflink:fedora.im>
18:35:09
since we're over time and it seems like there is still confusion here, I propose that we move the conversation around what we want to do for testing to discourse
<@trix:fedora.im>
18:35:47
yes.
<@tflink:fedora.im>
18:36:01
for what it's worth, I also have a proposed solution to all of this that I've been working on but that will wait for another day
<@tflink:fedora.im>
18:36:14
!action tflink to start conversation on discourse about testing desires and requirements
<@trix:fedora.im>
18:36:17
no worries, problem will still be here.
<@tflink:fedora.im>
18:37:08
ok, moving on to open floor if there's nothing else on this topic
<@tflink:fedora.im>
18:37:14
!topic open floor
<@tflink:fedora.im>
18:37:25
is there any topic we didn't get to that needs to be discussed today?
<@trix:fedora.im>
18:38:14
i'm good.
<@tflink:fedora.im>
18:38:24
ok, I'll end the meeting for now. we can always schedule something for next week if there is a need
<@tflink:fedora.im>
18:38:32
thanks for coming, everyone. I'll post minutes shortly
<@tflink:fedora.im>
18:38:35
!endmeeting